Disaster recovery needs to start with a pinpoint-accurate IT inventory. If you restore from backup–without restoring everything from backup–then it will throw a wrench in your dependencies. Worst-case scenario: the outage lasts longer and is harder to fix.
Disaster Recovery Basics
The backup-and-restore process is one of the most critical workflows in the corporate environment. This process has three important goals:
- Preserve continuity of service for customers: ideally, they’d never know that a disaster recovery event took place on their end.
- Restore productivity in the workforce: ensuring that your personnel get back to work as quickly as possible, helping to stem revenue loss due to outages or other disasters.
- Recover lost data: preventing duplication of efforts, protecting critical information, and reducing your exposure under various compliance regimes.
The average data center outage is over two hours long, and up to one third of data centers can experience an outage in a given year (a number that is going up, not down). The average length and increasing frequency of data center outages points to a troubling trend: the process of restoring from backup isn’t actually serving the three important goals above.
IT Complexity Stymies the Backup and Restore Process
Let’s say that your data center experiences a power failure or a ransomware attack. Disaster recovery means that you can restore your data, applications, and configurations settings as of seven days prior to the incident. In a DevOps environment in which it is very easy to make changes to VMs, cloud volumes, containers, and configuration settings, seven days is an eternity. Although restoring from full backup would be relatively easy, it might take weeks to duplicate the efforts of seven days’ worth of progress.
Most businesses know this, so they create incremental or differential backups. In a nutshell, these backups capture the files that have changed since the last full backup. While this is faster to do and cheaper in terms of storage space, these methods result in multiple different backup sets which must be pieced together in order to make for an accurate recovery.
Piecing together data from multiple backups takes longer than restoring from one large backup. There’s also more risk. If one of the incremental or differential backups is incomplete, your recovery will not be complete. In fact, this could even make the problem worse, because you won’t necessarily know what part of the recovery is missing and what is needed to fix it.
These difficulties aren’t abstract. Backup failures happen a lot. One huge example is the 2017 GitLab failure. During that incident, spammers created a minor problem by hammering the database, followed by a database replication incident. Although these incidents were minor, they proved intractable. In trying to solve them, a developer deleted about 300GB of production data. Despite the fact that GitLab deployed no fewer than five backup and replication techniques, none of them worked. GitLab was down for hours, and the team ended up restoring from a six-hour old backup.
A deeper analysis of the GitLab incident proves that their efforts to backup and restore mission-critical systems were indeed foiled by complexity. Some backups failed because of a version mismatch between PostgreSQL binaries. Others failed because Azure snapshots were enabled for some servers, but not for others. The backups to S3 simply didn’t work. In the words of one GitLab team member, “The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented.”
Reinforcing Backups with Inventory Tracking
Inventory tracking provides a birds-eye view of an organization’s software infrastructure. This includes all organizational assets–not just bare metal servers–including VMs and containers they’re running and the applications they contain, plus their operating systems, version numbers, and configuration settings. This information may even contain application dependency information.
Having this information and containing it centrally within a configuration management database (CMDB) is a godsend when it comes time to restore from backup during a disaster. If you’re trying to restore an application from backup and it’s not working, it’s possible you’re missing a dependent application that you need to restore first. Even if a backup fails or is incomplete, you’ll still be able to restore your data center to full function, faster, because you’ll be able to understand what’s missing.
With Device42, companies can conduct instantaneous infrastructure audits and gain an immediate understanding of their application dependencies. If you need a fast way to augment your disaster recovery efforts before your next outage, we can help you rapidly improve your posture. If you’re interested, download our 30-day free trial today!