Data center migrations aren’t a regular thing for most employees. Depending on the kinds of companies you work for and roles you hold, you might participate in a handful of major migrations over the course of a career. Before starting a data center migration, you need to figure out where you are going, what’s going there, and how to get it there.
These are some common pitfalls to avoid, mistakes others have made, tips to keep in mind, and otherwise general do’s and don’ts of data center migration (20 of them).
1 – Lacking a complete infrastructure assessment / not knowing what you have – To plan and execute a move successfully, you must first know all the details about everything you currently have running in your current data center. You need to know things like how many workloads are physical, how many are already virtual, how many are using local storage and how many rely on shared storage – and then you must decide how many are going to work the same way once moved, and how many workloads will be handled differently (e.g. p2v’d, moved to shared storage, etc.) in the new environment to correctly calculate the new infrastructure purchases and costs.
2 – Unclear Leadership – A project manager is a necessity not a nice-to-have. Someone must have overall responsibility for the move, authorizing next steps to take place, communicating with leadership about progress, and if necessary, calling for a cancellation and backing out changes if all else fails. The lack of a project manager could lead to a situation where there are disagreements on the move’s ability to continue forward successfully, and can ultimately result in a lot of extra unnecessary and costly downtime that could easily have been avoided if one person were given the authority to make the call.
3 – Not recognizing dependencies – From the simple things like two redundant power supplies being connected to the same UPS to multiple applications relying on a database that staff only thought was being used by one, understanding dependencies at all levels is extremely important to planning a successful move. If that UPS were taken down before one of the power supplies were moved to a separate power source, the server would fail; and if the database in question were taken down along with the single application that staff initially thought was utilizing it, other application failures would follow. A complete and up-to-date application dependency map is an absolute must when planning a successful data center move.
4 – Not writing a step by step procedure – When the middle of the night comes around, or staff is well into their second day without sleep, judgement begins to lapse and the temptation to take shortcuts becomes ever stronger. A step by step procedure detailing exactly what to do next, including commands, and definitions of what constitutes success is a key to the success of a long move. Good, thorough, and detailed pre-planning that includes necessary steps and commands spares tired staff having to make (bad) decisions, allows a move that might otherwise fail to continue through to success.
5 – Not verifying equipment will fit in the new space – You’ll want to either do a trial run to ensure any large equipment will clear tight corners or narrow openings or hallways, and that elevators (are working!) can handle the weight of equipment you are planning to move. Ordering before measuring and/or ensuring that it’s physically possible to move the hardware (racks, specifically) you’ve chosen into your new space can prove a costly mistake. Always measure before ordering, and if it’s close, order one example ahead of time and work out time test that the questionable equipment can be moved in well ahead of the migration.
6 – Underestimating the time necessary to complete the work – On paper, transferring 10GB of data over a 10gbps link should take 8 seconds; that is assuming you have 100% of the link’s bandwidth dedicated to the transfer, and that the transfer has no overhead and 100% efficiency. Some of these things are never true (there is always overhead) and others are likely to be the case when the time comes to execute, e.g. the likelihood of a link having no other traffic is slim to none if others are utilizing it as well. Staff, too, get tired, and the time to fill and cable one rack’s worth of equipment may be a lot less than the time required for the same staff members to rack and cable the 11th rack’s worth. Be extremely careful with pre-move calculations, always erring on the side of caution and budgeting extra time where feasible to allow for variations in things that are outside of your control. Underestimating could result in what is otherwise a successful move to be viewed in the eyes of management as a move that took too long, and resulted in too much extra downtime.
7 – Lack of a test plan – Lack of a test plan can cause entire moves to fail. A test plan should detail the definition of success, how each step will be validated, and who will sign off on each test’s success. This ensures that the “blame game” is avoided, and that when time comes for staff to return to work on Monday, applications are available, performance is within acceptable ranges, and things are otherwise working as expected.
8 – Lack of a back out procedure – A back out procedure is a must, and should be incorporated into the same plan that will be followed during the move. It may seem like a lot of extra work up front, but in the case that it becomes necessary to back out and postpone or abandon a move, both the staff performing the work and executives will be glad that it was prepared. It is also critical that the plan detail out exactly what sorts of failures will trigger the back out plan, and exactly who has the authority to determine that a failure has occurred and the authority to put the back out plan into action.
9 – Assuming you are “done” with your piece without thoroughly verifying – Take for example a database-driven application. The database team member brings up the new database servers, restores the database dumps to them, tests them by connecting and executing a few queries; all the data appears to be there, and the queries look good. He lets everyone know, shuts off his phone, and goes to bed. Not long after, the application team has the all clear that their servers are ready, sets up the application, and goes to run it, and it can’t query the database. The new database passwords were never communicated, and now the Database team is unreachable, and this phase of the migration is delayed until morning. The definition of “done” for each step must be carefully thought out to avoid situations like this!
10 – Lack of / Bad communication and Not coordinating with others – Communication can affect a migration’s success in many ways. Prematurely executing migration steps before others are ready can cause work to be duplicated, and conversely not communicating you’ve finished can result in others waiting around to do their parts, lengthening the migration and possible associated downtime. Providing executives a dashboard or another agreed upon means to keep tabs on progress, updating application owners, and communicating with users are all important as well.
11 – Ordering the wrong server racks – As the computing industry has evolved, so has the drive to increase compute density in every way possible. Be sure before you order 55 RU racks that your facility has the ceiling height to fit them, the floor strength to hold them, the power density to power them, and the cooling capacity to cool them! Any single one of these factors being overlooked can be a show stopper, causing larger racks to remain underutilized, resulting in wasted capital outlay, or worse, having to return them if they don’t even fit in the building.
12 – Ordering the wrong server form factor – Be extremely attentive when ordering hardware for a migration. Not only will mistakes like the wrong form factor (like non-rackable!), or the wrong rail kits really mess up a migration, but these mistakes can also be extremely costly to correct even if caught before move day. The logistics and cost alone involved in repackaging, addressing, reloading, and shipping back hundreds of servers, switches, or even rail kits can put a real dent in the perceived success of a migration.
13 – Not ordering enough racks or hardware – Calculating the number of racks and servers required for a migration may seem obvious, but it involves a lot of calculation. Rack size the facility can support (as discussed prior) must be taken into account, which will affect the number of racks necessary. The computing power of each server, and the amount of space each takes up will also dictate the number of both servers and racks that must be ordered. Getting these numbers right involves careful calculation that considers not only the capacity that is required to run today’s data center, but room for growth must be factored in, taking into account hardware that will be migrated physical to virtual, memory and processing requirements, density of applications that your organization is comfortable running on each server, and then on top of all that, accounting for peak load and failover capacity. For a large move, a slight miscalculation in one or more of these factors can result in capacity being severely under or over purchased, at the worst resulting in a completely failed move, and at best case resulting in a lot of extra spend with recurring costs for idle hardware. Though “measure twice, cut once” doesn’t exactly apply, the principle behind it surely does!
14 – Not taking power availability into account – This is the flip side of the coin to selecting rack size, as full racks use a lot of power, and the larger the rack, and the more powerful the hardware, the more power needed. The type of hardware to be racked is also a big factor. Idle servers will use a lot less power than fully loaded servers, while a rack full of spinning disks is going to use a lot more power than a rack full of SSD storage, the latter being much more power efficient. When calculating for power, rack size, what is to be racked, and power requirements of each item at full load must be taken into account. Running power vs. bootstrap (startup) power must also be considered. Power draw at startup is often an order of magnitude greater than power consumed while running or at idle (sometimes as high, or higher than full load draw!), so machines must be powered up in batches that take this into account to avoid overloading a circuit. Finally, don’t forget to properly size UPS solutions to support the denser, more power-hungry loads that larger racks with smaller, more powerful hardware can require!
15 – Under or overestimating cooling needs – Cooling needs are often thought of as a building requirement, and can be an afterthought in the minds of those planning a migration. Unfortunately, it will be at the forefront of your mind if you realize that you don’t have enough cooling to support the loads you’ve migrated in! In a worst-case scenario, this could result in having to power off hardware to lessen the cooling load, or even halt a migration in its tracks until larger units can be installed — assuming they can be! Install too much cooling, on the flip side, and you will be stuck paying for oversized cooling units that cycle too quickly, shortening the equipment’s lifespan and while not operating at peak efficiency. All this can be avoided with careful calculation and analysis ahead of move time!
16 – Migrating to a hosting provider without testing their claims (e.g. not enough bandwidth to serve all customers, not enough compute to serve peak load) Make sure that you’ve exercised your hosting provider before going all in with them. Talk to their other customers, if possible, and don’t spring for the lowest cost provider based on dollars alone, or you may be in for a very hard lesson. Some hosting providers have been known to over sell their capacity, and at peak times, all customers workloads are affected and can perform poorly, or not at all. Ensure that if a data center promises you a 1gbps internet pipe, they aren’t selling that same 1gbps to three other customers! Ask the right questions, and do due diligence ahead of time before making the jump.
17 – Having no plan for what to do with hardware from the old data center – Getting rid of old hardware can actually turn out to be a significant cost. There are many companies that specialize in doing just that, as stringent requirements must be followed if your organization is in the business of handling certain kinds of sensitive data such as medical records, financial or credit card information. If the hardware isn’t that old, on the flip side of the coin there can be significant cost savings in re-selling it. The used hardware market is huge. Dell operates an entire line of their business just selling used and refurbished hardware. The bottom line is to think about what you will do with old hardware after you turn it off, as if you haven’t made any plans, you’ll be stuck paying data center rates to store it until you do!
18 – Migrating to a new technology based on marketing material & buzz over true business need — and dealing with all the issues that come w new tech (instability, headaches, downtime) … Avoid being the “new tech guinea pig.” Do everything in your power to demonstrate to stakeholders the risks involved in migrating to new technology “X” if you know for certain it will do more harm than good, or that in the end it won’t provide benefit to the company even if it works as advertised. It’s best to avoid this situation all together, but sometimes it isn’t your choice. If there is still push back, at the very least try to work out an arrangement to do a partial, trial migration to the new technology before going all in.
19 – Rushing to migrate workloads to the cloud without having the in-house expertise with the experience to manage cloud workloads efficiently – Migrating workloads to the cloud is only one step of the process, and you must be sure that you have the expertise to accomplish it. However, once everything has been migrated, you must also be sure that you have the on staff expertise to manage cloud workloads, as the methodologies behind their monitoring can differ greatly from an onsite data center, no longer having the ability to walk into your server room and move wires or hook up a keyboard and monitor to an unresponsive misbehaving mission critical server.
20 – Not taking time to move the actual data into consideration – i.e. Underestimating the bandwidth of a “station wagon full of hard disks / tapes” — Moving terabytes of data takes a significant amount of time and bandwidth. The way this will be accomplished, and the timings must be worked out ahead of time, especially if getting a copy of the data requires server downtime, such as is the case with dumping a MySQL database; with larger databases, this downtime can be very significant. Finally, testing the integrity of the data once it arrives, being sure to know who can and will test and sign off on it. This all requires pre-move planning, and it very well may be decided that the best way to move the data is via a trunkload of high capacity harddisks or tapes.