How we migrated a 1TB database from Heroku to AWS Aurora with (almost) no downtime

Earlier this year we decided that we wanted to move our PostgreSQL database for the Vericred API from Heroku to AWS Amazon Aurora. Aurora offers higher performance, better scaling options, and more flexibility at a lower cost while still being fully managed. Aurora is also PostgreSQL compatible, so the change appeared seamless to our API and our engineers. With the help of Heroku’s Data Solution Architecture team we were able to accomplish the migration, and additionally upgrade from Postgres 9.6 to 11.6 with just a few minutes of cumulative downtime.

We decided to perform the upgrade from Postgres 9.6 to 11.6 before migrating to Aurora. We wanted to keep the number of things changing at once to a minimum and figured a major version upgrade a couple of weeks before the migration would give us ample time to iron out any issues, of which there were none. The process for upgrading the major version of PostgreSQL in Heroku was straight-forward: we brought up a follower database, waited for it to catch up to the primary database, upgraded the follower, put Vericred API into read-only mode, promoted the follower database to be the primary, and exited read-only mode. We performed the migration overnight when there was minimal activity on the API. The API was unavailable for less than five minutes while the follower was promoted to the primary database and the Heroku dynos were recycled.

On a high level, the process for migrating from Heroku to AWS Amazon Aurora was similar to upgrading major versions within Heroku: Create a new database to follow the one you want to migrate, put your API into read-only mode so the follower catches up with any new commits, promote the follower, exit read-only mode, go about your day. But there were a couple of issues that complicated matters. Heroku uses AWS hardware, but their AWS account owns the database instance, not Vericred’s. And Heroku couldn’t just give us carte blanche access to the instance. The other issue was you can’t configure an Aurora instance to follow a non-Aurora instance, you have to use something like logical replication or AWS Data Migration Service to get the data into Aurora.

To solve the first problem, that we didn’t have full control over the source database in AWS, we needed the help of Heroku’s Data Solution Architecture team. They setup a secure S3 bucket with a recent backup of our database and up-to-date WAL files. In our AWS account we setup an EC2 instance and installed PostgreSQL and a tool called Wal-E. Wal-E is a program that allows you to setup continuous archiving via a base backup and WAL files. The Wal-E documentation was great and setting up continuous archiving was fairly simple. It took several hours for Wal-E to catch up to the primary database, but when that was done, we had a read-only follower in AWS of our primary database in Heroku. Like with the major version upgrade, we put our API into read-only mode, made the follower our primary database, and recycled our application dynos. Total downtime was less than a minute.

Once the primary database was hosted in our AWS account, we had to get the data into Amazon Aurora. To do this we used AWS Data Migration Service. With DMS you define source and target databases and then define one or several tasks to move tables and rows to the target database. DMS can also validate the data is the same between the source and target after it has been migrated. We needed to migrate approximately 120 tables and they fell into one of four categories each with their own DMS task. The first criteria was whether or not the tables needed to be continuously replicated after migration, which DMS supports. About 90% of our tables get updated due to our API interacting with internal applications at Vericred. We had the luxury of temporarily turning off any services in our API that updated these tables during the migration without impacting our customers. The second consideration was whether or not a table contained a LOB column. LOB columns are types with large binary objects and can include text, JSON, JSONB and BYTEA. DMS can successfully handle LOB columns with large amounts of data using “FULL LOB MODE” but running a task in this mode severely impacts migration performance, so it’s important to segregate tables that require full LOB mode into a separate task. Migrating all the tables took about four hours. The migration of tables with LOB columns consisting of JSONB that could be several megabytes in size took up the majority of that time.

DMS migrates the tables and the data inside them, but it doesn’t migrate indices, primary keys, foreign keys, triggers, or user defined types and functions. After the data migration was complete, we had to reapply all of these by running SQL scripts we prepared ahead of time on the Aurora instance. Running all these scripts took another two and a half hours. After that was complete, we were ready to cutover to Aurora as our primary database. Again, we put the API in read-only mode, turned off continuously replication in DMS, pointed the API at the new database, and recycled the application dynos. This took less than a minute and we were back up and running.

The migration took a fair amount of planning, coordination, and testing by the engineering team, as well as some help from the experts at Heroku. But, all in all it turned out to be less painful than expected to migrate our 1TB database not only from one cloud provider to another, but from one database engine to another with almost no noticeable downtime to our customers.