Unplanned outages are the worst. One quarter of all companies report that an hour of unplanned downtime now costs between $300,000 and $400,000 – and 15 percent of companies now say that an hour of unplanned downtime costs more than $5 million.
These costs are made up of many factors – lost sales and lost productivity for sure, but also the cost of overtime for server technicians, replacement parts, new cloud volumes, hastily-repurchased support licenses, and so on. There’s also a long-tail effect as consumers remember that your services were unavailable and decide not to do business with you in the future. Companies might even encounter permanent data loss, legal liability, and regulatory impacts.
Ideally, you’d have a plan for when your servers and applications glitch without warning. You’d be able to failover to different network connections, spin up new instances, and generally ensure that your customers never even notice so much as a flicker in their connections. If you really want to postpone your next system outage, however, you’re going to need an accurate CMDB.
Why Does Downtime Take So Long to Resolve?
In spite of their terrible expense, server outages can end up taking a great deal of time. On average, a total data center outage lasts well over two hours, whereas a partial data center outage may last 59 minutes. For 25 percent of companies, that means that a total data center outage will cost between $600,000 and $800,000. Results from an Uptime Institute survey show that at least one third of data centers had an outage in 2017.
From the survey above, causes of downtime were split roughly evenly between power outages, network failures, and infrastructure/software errors. Far more important is the next number – 80 percent. That’s the number of data center managers who say that the last outage they experienced was preventable.
One of the problems with administering a data center is that data centers (no surprise) are complicated. At its inception, a data center is laid out with exacting detail, but its systems become increasingly ramified as time goes on. First, an engineer adds some servers and network hardware without writing anything down. Then they leave the company. Then another engineer attempts to sort out the kludge they left behind, creating even more of a kludge. Then they leave. And so on.
In the worst-case scenario, no one currently working at the data center has a great idea of how things are connected, so solving an outage might involve anything from using Netflow or packet capture to physically tracing wires around your racks. This is ideally the kind of thing you want to do before an emergency, but like many common fundamentals, a proper data center audit can be very hard to do.
Don’t Just Audit – Use an Accurate CMDB
To prevent frantic games of follow-the-wire during an outage, your first step may be to conduct a data center audit. This may be easier said than done. To begin with, as we’ve said, the data center is complicated. It might take weeks or months to do a full accounting, which means that more important or innovative projects will go by the wayside.
In addition, the data center is much more than physical infrastructure – and it’s also not simply an operations-only shop. Under DevOps, developers may have the ability to make extensive changes in your data center infrastructure in order to support mission-critical applications. In addition, they may spin up VMs, containerize existing applications, or connect cloud instances. What this means is that any audit of your data center is merely a snapshot.
With an accurate CMDB, audit time is reduced by orders of magnitude. Instead of taking weeks to catalogue your infrastructure and applications, an accurate CMDB might take hours or minutes. What’s more, the CMDB reacts to additional changes as soon as they occur – it stays constantly up to date. Lastly, an accurate CMDB can build a map of application dependencies that includes VMs, microservices, and cloud instances.
In terms of downtime, your CMDB will help you understand complicated failures in a heartbeat. You’ll be able to see not only which applications or servers are unresponsive, but also which dependent infrastructure has failed alongside them. By effortlessly tracing outages to their source, you’ll be able to shave precious time off the outage clock, potentially saving hundreds of thousands of dollars.
Here at Device42, we offer the industry’s most responsive and accurate CMDB product. If you’d like to see more about having Device42 can rescue your data center from its next outage, download our demo today!