By Francis Miers, Director,Automation Consultants
TSB’s data migration crisis in April ranks with every business’ worst nightmare.
The process of transferring customer data from an IT platform run by its former owner, Lloyds TSB, to one run by its new owner, Sabadell,became a series of unfortunate events which, media reports at the time suggested, compromised customer privacy, disrupted customer money transfers and saw some customers lose money as a result of fraud.
The consequences for the bank were severe: £200 million in costs, a loss of customers put at 12,500, and incalculable reputational damage. Its CEO of seven years, Paul Pester,was forced out in September.
The stakes are high when moving critical enterprise IT systems, but it need not result in crisis, and the broad principles governing a successful migration are similar whatever the scale.
The two main things to avoid in a migration are an unplanned outage of the service for users and loss of data, either in the sense that unauthorised users have access to data, or in the sense that data is destroyed.
In most cases, outages cannot be justified in business hours, so migrations must typically take place within the limited timeframe of a weekend. To be sure that a migration over a weekend will run smoothly, it is normally necessary to perform one of more trial migrations in non-production environments, that is, migrations to a copy of the live system which is not used by or accessible to real users. The trial migration will expose any problems with the migration process, and these problems can be fixed without any risk of affecting the service to users.
Once the trial migration is complete, has been tested and any problems with it fixed, the live migration can be attempted. For a system of any complexity, the go-live weekend must be carefully pre-planned hour by hour, ensuring that all the correct people are available and know their roles. As part of the plan, a rollback plan should be put in place. The rollback plan is a planned, rapid way to return to the old system in case anything should go wrong during the live migration. One hopes not to have to use it because the live migration should not normally be attempted unless there has been a successful trial migration and the team is confident that all the problems have been ironed out.
On the go-live weekend, the live system is taken offline, and a period of intense, often round-the-clock, activity begins, following the previously made plan. At a certain point, while there is still time to trigger the rollback plan, a meeting will be held to decide whether to go live with the migration or not (a “go / no go” meeting). If the migration work has gone well, and the migrated system is passing basic tests (there is no time at that point for full testing; full testing should have been done on the trial migration), the decision will be to go live. If not, the rollback plan will be triggered and the system returned to its previous state, that which was obtained before the go-live weekend.
If the task of migration is so great that it is difficult to fit it into a weekend, even with very good planning and preparation, it may be necessary to break it into phases. The data or applications are broken down into groups which are migrated separately. This approach reduces the complexity of each group migration compared to one big one, but it also has disadvantages. If the data or applications are interdependent, it may cause performance or other technical problems if some are migrated while others remain, especially if the source and destination are physically far apart. A phased migration will also normally take longer than a single large migration, which will add cost; and it will be necessary to run two data centres in parallel for an extended period, which may add further cost. In TSB’s case, it may have been possible to migrate the customers across in groups, but it is hard to be sure without knowing its systems in detail.
What can go wrong?
Migrations can be expensive because it can take a great deal of time to plan and perform the trial migration(s). With complex migrations, several trial migrations may be necessary before all the problems are ironed out. If the timing of the go-live weekend is tight, which is very likely in a complex migration, it will be necessary to stage some timed, trial migrations, “dress rehearsals”. Dress rehearsals are to ensure that all the activities required for the go-live can be performed within the timeframe of a weekend.
Trial migrations should be tested. In other words, once a trial migration has been performed, the migrated system, which will be hosted in a non-production environment, should be tested. The larger and more complex the migration, the greater the requirement for testing. Testing should include functional testing, user acceptance testing and performance testing.
Functional testing of a migration is somewhat different from functional testing of a newly developed piece of software. In a migration, the code itself may be unchanged, and if so there is little value in testing code which is known to work. Instead, it is important to focus the testing on the points of change between the source environment and the target. The points of change typically include the interfaces between each application and whatever other systems it connects to.
In a migration, there is often change in interface parameters used by one system to connect to another such as IP addresses, database connection strings and security credentials. The normal way to test the interfaces is to exercise whatever functionality of the application uses the interfaces. Of course, if code changes are necessary as part of a migration, the affected systems should be tested as new software.
In the case of TSB, the migration involved moving customer bank accounts from one banking system to another. Although both the source and target systems were mature and well-tested, they were different code bases, and it is likely that the amount of functional testing required would have approached that required for new software.
User acceptance testing is functional testing performed by users. Users know their application well and therefore have an ability to spot errors quickly, or see problems that IT professionals might miss. If users test a trial migration and express themselves satisfied, it is a good sign, but not adequate on its own because, amongst other things, a handful of user acceptance testers will not test performance.
Performance testing checks that the system will work fast enough to satisfy its requirements. In a migration the normal requirement is for there to be little or no performance degradation as a result of the migration. Performance testing is expensive because it requires a full-size simulation of the systems under test, including a full data set.
If the data is sensitive, and in TSB’s case it was, it will be necessary, at significant time and cost, to protect the data by security measures as stringent as those protecting the live data, and sometimes by anonymising the data. In the case of TSB, the IBM enquiry into what went wrong identified insufficient performance testing as one of the problems.
Where did it go wrong for TSB? The bank was attempting a very complex operation. There would have been a team of thousands drawn from internal staff, staff from IT service companies and independent contractors. Their activities would have had to be carefully co-ordinated, so that they performed the complex set of tasks in the right order to the right standard. Many of them would have been rare specialists. If one such specialist is off sick, it can block the work of hundreds of others. One can imagine that, as the project approached go-live, having been delayed several times before, the trial migrations were largely successful but not perfect.
The senior TSB management would have been faced with a dilemma of whether to accept the risks of doing the live migration without complete testing in the trials, or to postpone go-live by several weeks and report to the board another slippage, and several tens of millions of pounds of further cost overrun. They gambled and lost.
How to minimise the risk
How could TSB have done things differently? How can someone managing a migration avoid a fate similar to Paul Pester’s?
First, a migration should have senior management backing. TSB clearly had it, but with smaller migrations, it is not uncommon for the migration to be some way down senior managers’ priorities. This can lead to system administrators or other actors, whose reporting lines lead elsewhere from those doing the migration, frustrating key parts of the migration because their managers are not ordering them or paying them to co-operate.
Secondly, careful planning and control is essential. It hardly needs saying that it is not possible to manage a complex migration without careful planning and those managing the migration must have an appropriate level of experience and skill. In addition, however, the planning must follow a sound basic approach that includes trial migrations, testing and rollback plans as described above. While the work is going on, close control is important. Senior management must stay close to what is happening on the ground and be able to react quickly, for example by fast-tracking authorisations, if delays or blockages occur.
Thirdly, there must be a clear policy on risk, and the policy should be stuck to. What criteria must be met for go-live? Once this has been determined, the amount of testing required can be determined. If the tests do not pass, there must be the discipline not to attempt the migration, even if it will cost much more.
Finally, in complex migrations, a phased approach should be considered.
Rules for success
In TSB’s case, the problems that occurred after the live migration were either not spotted in testing, or they were spotted but the management decided to accept the risk and go live anyway. If they were not spotted, it would indicate that testing was not comprehensive enough – IBM specifically pointed to insufficient performance testing. That could be due to a lack of experience among the key managers.If the problems were spotted in testing, it implies weak go-live criteria and/or an inappropriate risk policy. IBM also implied that TSB should have performed a phased migration.
It may be that the public will never fully know what caused TSB’s migration to go wrong, but it sounds like insufficient planning and testing were major factors. Sensitive customer data was put at risk, and customers suffered long unplanned outages,resulting in CEO Paul Pester being summoned to the Treasury select committee and the Financial Conduct Authority launching an investigation into the bank. Ultimately Pester lost his job. When migrating IT systems in the financial sector, cutting corners is dangerous. For success one needs to follow some basic principles, use the right people and be prepared to allocate sufficient time and money to planning and testing.