Switching cloud

April 08, 2022

Taken from a comment on HackerNews:

Set up haproxy, nginx or similar as reverse proxy and carefully decide if you can handle retries on failed queries. If you want true zero-downtime migration there’s a challenge here in making sure you have a setup that lets you add and remove backends transparently. There are many ways of doing this of various complexity. I’ve tended to favour using dynamic dns updates for this; in this specific instance we used Hashicorp’s Consul to keep dns updated w/services. I’ve also used ngx_mruby for instances where I needed more complex backend selection (allows writing Ruby code to execute within nginx)
Set up a VPN (or more depending on your networking setup) between the locations so that the reverse proxy can reach backends in both/all locations, and so that the backends can reach databases both places.
Replicate the database to the new location.
Ensure your app has a mechanism for determining which database to use as the master. Just as for the reverse proxy we used Consul to select. All backends would switch on promoting a replica to master.
Ensure you have a fast method to promote a database replica to a master. You don’t want to be in a situation of having to fiddle with this. We had fully automated scripts to do the failover.
Ensure your app gracefully handles database failure of whatever it thinks the current master is. This is the trickiest bit in some cases, as you either need to make sure updates are idempotent, or you need to make sure updates during the switchover either reliably fail or reliably succeed. In the case I mentioned we were able to safely retry requests, but in many cases it’ll be safer to just punt on true zero downtime migration assuming your setup can handle promotion of the new master fast enough (in our case the promotion of the new Postgres master took literally a couple of seconds, during which any failing updates would just translate to some page loads being slow as they retried, but if we hadn’t been able to retry it’d have meant a few seconds downtime).

Once you have the new environment running and capable of handling requests (but using the database in the old environment):

Reduce DNS record TTL.
Ensure the new backends are added to the reverse proxy. You should start seeing requests flow through the new backends and can verify error rates aren’t increasing. This should be quick to undo if you see errors.
Update DNS to add the new environment reverse proxy. You should start seeing requests hit the new reverse proxy, and some of it should flow through the new backends. Wait to see if any issues.
Promote the replica in the new location to master and verify everything still works. Ensure whatever replication you need from the new master works. You should now see all database requests hitting the new master.
Drain connections from the old backends (remove them from the pool, but leave them running until they’re not handling any requests). You should now have all traffic past the reverse proxy going via the new environment.
Update DNS to remove the old environment reverse proxy. Wait for all traffic to stop hitting the old reverse proxy.
When you’re confident everything is fine, you can disable the old environment and bring DNS TTL back up.

The precise sequencing is very much a question of preference - the point is you’re just switching over and testing change by change, and through most of them you can go a step back without too much trouble. I tend to prefer ensuring you do changes that are low effort to reverse first. Need to keep in mind that some changes (like DNS) can take some time to propagate.

EDIT: You’ll note most of this is basically to treat both sites as one large environment using a VPN to tie them together and ensure you have proper high availability. Once you do, the rest of the migration is basically just failing over.