Rehearsing migrations

Data Centre Migration Strategies – Should you rehearse?

To rehearse or not to rehearse that is the question[1]

Many professional service organisations, including mine,  often recommend an approach where each server image migration is rehearsed some number of times before doing the live migration itself. (Note: I use the term “server image” , at least some of the time, as increasingly migrations will feature a high percentage of virtualised systems) Why rehearse? Well a lot of the time you don’t know what you don’t know. In an ideal world all systems and services will be well documented and you could use this collateral to assist with planning your migration actions. Even if this were true, which often it is not, the documentation is typically not geared up to explaining how to actually migrate a set of related servers. So in the real world what we may end up doing is rehearsing the migration, probably breaking something in some subset of the migrated systems, working out how to fix it, fixing it and then trying it again. If you were being really strict in applying this process you would stipulate that any given server could not be migrated until it had gone through a “clean” rehearsal migration. This approach by its very nature is conducive to a higher success rate of “final” migrations. That is to say in terms of a metric of trouble free live migrations it is a high quality approach. As the rehearsals are trial migrations there is usually sufficient time between rehearsals to investigate any issues with the migrated application in-depth. So what is the downside? Well as with most endeavours “quality costs money”. In this case this increased cost is down to a, potentially, protracted migration timeframe, use of resources for that timeframe and multiple iterations of a simulated live cutover with all its associated management, staffing and communications overhead. Once the migration server image count gets in to the hundreds and more the projected costs of this approach can start to look scary. The difficulty for many DCM project managers and technical specialist is to defend this approach. They may be pushed to adopt a less rigorous migration approach to stay within budget.

So what is the alternative? Well firstly you can adopt an approach of doing “one hit” migrations. With this approach the first time you do the migration the intention is that you go live with it, come what may. Typically you have already decided that you, and more importantly the business, will accept the risks associated with this approach. It assumes you will have both the time and resources to “fix-forward” any issues that arise from the migration. The risks may be low if, for example, you have already migrated this type of server and service before (in the recent past) and you feel you have a good understanding and handle on any potential issues. Or it may be that the servers you are migrating can tolerate extended outages, for example some types of test and development servers. It doesn’t matter how you slice and dice it the risks with this migration approach are usually higher. With the “Rehearsal” based approach your test teams have days or even weeks to satisfy themselves that the migrated servers and applications function as expected. In the “one hit” approach there will typically be very little time for testing, hours rather than days, and the testing scope and coverage will need careful planning and preparation to stand any chance of providing quality input to a “go/no-go” decision. You may elect to use a hybrid approach with “one hit” for low risk servers and rehearsal based for business critical servers. One example of such a hybrid approach that I have recently been involved with migrated front end (stateless) web servers without a rehearsal. The DNS entries for these web servers had very short TTL entries and the mitigation was that if problems arose we would simply flip the DNS entries back to the servers in the old Data Centre. The backend database servers, however, were migrated more traditionally with several rehearsals taking place. This was because the potential impact of problems with migrated database servers was significant.

.

[1] With kind acknowledgments and apologies to William Shakespeare Hamlet Act 3