For the second time in my career, I have survived a big migration where an old, trusted for many years and critical piece of software was being switched off so a new one could take its place.
Migrating was a multi-step process where not only software was involved, but this text focuses on the design, implementation, and execution of custom-written code that modifies or creates data that is required by the new system. There are generic tools for this, but sometimes the complexity of the transformations exceeds what they can offer.
Due to the nature of the applications and the expectations of stakeholders and clients, the migration was made offline. The system could be switched off and turned on in the space of a weekend. Tens of millions of records were processed. This text does not discuss patterns of high availability.
Writing the code
To write a migration is to know the data (or should be). Even though relational databases enforce an explicit schema (schema on write) that ensures that all data adheres to it, when a company has been in business for long enough weird things start to happen. As applications change, nuances and differences appear: a field that should not allow null values but is filled with them due to errors in applications, columns where the meaning has changed over time, stale data left in tables, etc. Every second used in understanding this is probably time well spent.
In this case, the target system was being developed jointly with the migration, and it did consume logic and validation this new software had already in place. Code for the migration and the new system was being written by the same people at once.
One needs to be careful and make sure that the migration does not influence the new system's code in a way in which this code ends up including assumptions and details that only belong to the migration itself. The services, modules, functions, etc. of the target system should probably know nothing about the special context in which they are being consumed, and if they do, this should be made explicit.
Software solutions, when they are complex enough, perform their work in two different realms. There is a transactional part where things happen atomically triggered by the request of a user, system, etc. However, the response to the user request can involve operations that cannot be delivered in real-time, and therefore, they are deferred. Sending an email, querying a slow endpoint, or indexing information on a search index are examples of this. These are usually put in a queue and processed outside the context of the request.
What needs to happen with the non-transactional part should be properly planned. You may need to skip or defer some of these steps. Ideally, the target system functions to be used would expose as part of the public API a way to configure this. In any case, you must know.
Verifying its correctness
Migrations can put programmers in a weird position. They are often underestimated. The migration that motivated gathering these notes was important enough so almost literally everyone in the company had a clear picture of what was going to happen and when, but this is not always the case. They can easily be perceived as tasks that do not yield that much value. In cases like these, it is hard to get any recognition for a job well done but really easy to get the blame if things go wrong. Part of the duty of the person responsible is to try to communicate effectively to the organization the difficulties behind the curtains.
As with any other task, the code to support the migration needs to be written in some finite time and tradeoffs will be made, but one main takeaway is that a good approach for testing always pays off. If you would build a CI pipeline for any of your products this one should not be different.
If you have been doing this long enough, it is easy to see how a new requirement given the last day, apparently trivial, can cause a bug. And it is also easy to see how migrations can exacerbate the severity of the defect: they usually target more rows than the average feature in just one execution and the amount of wrong information created can be significant. In a distributed context, if the information created is going to be reinterpreted by other services as it happens with events, the issue does not remain isolated in one place but infect many.
We wrote a test suite of, generally speaking, integration tests, especially for parts of code where what was being made resembled a simple reallocation of the data from the origin to the target. For some other modules, we went further and tested in complete isolation looking for really high coverage. We also learned how some apparently simple properties of the dataset were extremely useful to reveal problems. The most useful properties would change depending on the problem but for example, we found that generating a summary at the end of each run, in a format that could be compared between executions, with some counters of the number of records of the most important tables did catch some mistakes that would have gone unnoticed otherwise. The generation of this file was part of the pipeline.
The idea is that not only the code must be tested, output data too.
Performance
Generally speaking, performance has not been one issue where I had to spend too much time in my career. Certainly, I have had my fair share of weird things and problems, but not so much there. These ~15 years of software development have been ruled by languages that were managed or interpreted and also by a constant increase in the computational power that was available for programmers. A few milliseconds more here and there were fine most of the time if the code was easier to understand or organize.
Enterprise and web applications are usually I/O bound and although exceptions surely exist, I feel there is a list of known problems (excessive chattiness with the DB, N+1 problem in ORMs, etc.) and things you should not do. If you manage to stay away from them you will probably be ok.
However, if you are going to process ten million rows, things are different. A 10ms increase per operation makes a difference of 27 hours.
Measuring is vital. We used Tufte because it is application-aware and we could decide exactly the parts of the code where we wanted to focus. It can also do the same job in other environments (such as staging or production) which was extremely helpful. Environments differ heavily and having all the proper metrics is key. Questions like: what is the average time needed to process one item? or the longest? what is causing the difference? must be answered.
The problems we found are probably not very surprising and they could sound trivial. However, we forget too often about trivial things.
Reading data
Indexes are additional data structures that live in the database to improve performance. They are a derived construct. A migration may need to access records in a very different (broader) way. It will be more than likely that the indexes found in the source system do not give you what you need.
All the queries performed by the migration must be reviewed to make sure there is a proper query execution plan for them that uses well-chosen indexes.
Writing data
I think it is good advice that applies generally to all RDBMS is to insert as many rows as possible at once. There are some optimizations that can be made in multi-row inserts.
In distributed environments roundtrips to the database cause an important overhead that one needs to add to the time needed to retrieve or write the information. Being vigilant for an excess of chattiness between the application and the database server is important. The same code can take twice or thrice as much for cold queries in production compared to a laptop where the database instance is co-located with the software.
Reusing code
Code reuse is desirable. Leveraging code that the target system already has can speed up development time. Unfortunately sometimes using high-level code can be problematic. Already existing functions were probably designed and written with a very different context in mind: fulfill one request for the user. This code could be requesting more data than we need, parsing a JSON field we will not use as part of a post-processing pipeline, or even worse, calling an unexpected external service that attempts to do something else. One would hope that the external API of the code consumed would be clear about it, but that is not always the case.
As mentioned before, this could be completely fine in the context of processing one request, but processing millions of rows is something else.
Consuming existing code can make sense if you know exactly what it does. Otherwise, make sure that queries are as raw as possible, where the fields that need to be sent or retrieved are completely specified. Verify that the data layer is not, just to put one example, automatically parsing JSON fields defined in the schema (if you do not need it).
Curiously enough, logging too much can be also expensive. Watch for this.
Logging
As in any other software, logging helps to detect behaviors that can hardly be predicted, reveal unexpected delays between operations, and inform about the identifiers that are being hit. This is extremely important to monitor the migration and also to troubleshoot problems once it has finished. As discussed, migrations are a pretty particular task in software due to the broad scope they manage, and as such, effective logging is essential.
The messages (in the parts where there is human-readable text) must be well-written and expressive. They should contain also whatever context is relevant to understand the message (for example, the record identifier that has been updated, or the parent element in which some sub-modification is being done, etc).
Granularity is key. Less logging than needed and vital information will be missing to make sense of situations when reports of bugs arrive, and too much of it and it will be equally harder to get any value out of it in this case due to the amount of noise.
In summary, the right balance is hard to find and depends among probably other factors on the expertise of the developer. It can, and it should be iteratively improved, and be treated as a relevant topic from the beginning.