Databases are notoriously fussy to work with. Postgres is no exception. Though the software itself may be pretty solid, stuff like major version upgrades or migrations to “the cloud” (or back to on-prem) are really tricky to do without significant and costly downtime. Though there’s tools out there to make this process easier, many of these simply don’t work for anything more than small test databases, and will silently corrupt tables or fail spectacularly in real-world scenarios.

This post is about how to safely migrate a real-world Postgres database without downtime using pglogical. As a bonus, this procedure works to migrate an on-premise db to AWS RDS (many tools don’t work with RDS), and you can perform multiple major version upgrades as part of the process (skip as many versions as you want!).

I haven’t written any blog posts for a very long time. Writing these posts is a lot of work - I usually only sit down to write something when it’s of use to me and public documentation doesn’t exist or is otherwise very sparse. This is one of those articles. (Looking to migrate a MySQL / MariaDB database in a similar manner? Check out this guide from AWS.)

pglogical: Why are we using it?⌗

The goal of migrating a database is to create an identical copy of it on a separate piece of infrastructure, either on another VM, another datacenter, or perhaps even another country. There’s a lot of different ways of migrating Postgres databases and unfortunately all of them have significant limitations. Let’s quickly do an overview of the different Postgres tools and demonstrate why pglogical is the least bad option (at the time of writing).

Dump and restore⌗

This is the most basic method of moving a database. You create a database dump from the source database with a tool like pg_dumpall or pg_basebackup and restore it on the target. Obviously, this is not a great option when you want to avoid downtime. Depending on database size, it can take hours to create the initial database backup, and many hours to restore it on the new target instance. Any writes that occur on the source instance after the backup is taken are lost. Though this method is virtually foolproof and can perform upgrades as part of the process, it obviously incurs significant downtime. This isn’t an option for many businesses, and likely everyone involved would prefer it if there was no disruption to the business at all during the migration.

Binary replication / “hot standby” databases⌗

Binary replication is the easiest to setup, and lets you create a read-only replica db from your original master. Out of all the replication options, this is by far and away the best option. It “just works”. If you’re looking to setup binary replication, honestly the best starting point is the official documentation:

Quick tutorial: https://wiki.postgresql.org/wiki/Hot_Standby
More detailed overview: https://www.postgresql.org/docs/current/hot-standby.html

Unfortunately, binary replication has several key disadvantages:

It doesn’t work with most cloud providers. If you want to migrate to a managed database like AWS RDS, you won’t actually have the superuser permissions and tools to set this up.
Binary replication only works with DBs of the same major version. If you want to replicate to a different version, well… you can’t.
Replication is one-way: you can only have a single “master” database active at any given time. (Unpopular opinion: if you want master-master replication that doesn’t suck, you should honestly just switch to MariaDB, where this is a solved problem.

If you are not upgrading to a new database major version and are not trying to migrate to a managed service like AWS RDS, stop reading this article now and just use binary replication. It’s the simplest option and is the fastest path to a successful migration.

Third-party replication tools⌗

There are a lot of other replication tools out there designed to address some of the shortcomings of Postgres’ built-in binary replication. I won’t go too deeply into each one, but whether they work or not is highly dependent on what your database looks like: how big is it, what types of data you have in there, where you’re trying to migrate to (when using managed databases like RDS, you’ll frequently be missing the pemissions necessary to set things up), etc..

While trying things out, we explored trigger-based replication tools like Slony, Bucardo, and Londiste. All of these had significant issues with the databases we tried to migrate. In particular, replication frequently broke, and there were numerous issues with truncated or empty tables where a database that had supposedly been replicated was missing data. As mentioned before, success is highly dependent on the database you’re trying to migrate. Simple databases that don’t use special datatypes or triggers are much more likely to work. It’s possible that one of these tools may work for your DB, but you’ll need to try them out to know for sure.

AWS has a managed “Database Migration Service” (DMS): this is a proprietary AWS tool that live-replicates data from one database to another. It also sucks. In addition to taking an extremely long time to replicate large databases, silent corruption and truncation of tables is extremely common - even if a migration survives the initial copy phase (in my experience, tables with more than a hundred million rows will consistently break a DMS replication instance irreparably, especially if record validation is enabled), many records will be altered in mysterious ways. Some of the more entertaining failures I encountered included DMS shifting some, but not all, timestamps in a MySQL database 3 hours into the future (no, this wasn’t timezone-related), and DMS rounding all of the columns that used Postgres’ money datatype to the nearest ten cents. If you’ve been considering using DMS on any DB you care about… don’t. Your time is better spent investigating other migration options that actually work reliably. DMS is only worthwhile to look into if you need to switch DB technologies completely (such as from Oracle to Postgres).

Postgres logical replication and pglogical⌗

At this point, migrating a Postgres database without downtime of some kind is looking increasingly impossible. Enter pglogical:

pglogical is a “logical replication” tool for Postgres. Instead of replicating filesystem-level changes (binary replication), pglogical replicates the actual database SQL statements from one db to another target db. Executing an insert on the master db will execute the same insert on any dbs “subscribed” to it. Though the tool has its own limitations (we’ll get to those in a second), logical replication has several MAJOR advantages over other methods:

You can set filters and rules for what you want to replicate. Instead of creating an exact binary copy of another DB, you can selectively replicate only parts of it.
Replication works between major versions. Because the actual SQL statements being replicated are not specific to a particular version, you can replicate between different major versions without issue. You can even skip multiple major Postgres versions in one go (For what it’s worth, I’ve successfully done 9.4->9.6 and 9.6->11 without issue, though your results may vary).
Cloud providers like AWS actually support it. AWS RDS has the pglogical extension built into their RDS images on Postgres 9.6.10 and above.
It’s one of the only replication tools that supports really old versions of Postgres (9.4+).
It’s free and open source. Hopefully I don’t need to explain why this is awesome.
It actually works. Though the tool is still pretty tricky to use, pglogical is remarkable for the fact that it hasn’t failed me yet.

But wait, what about Postgres’ native logical replication? Postgres recently gained the ability to perform logical replication on its own in version 10. So how is pglogical different? As it turns out, they’re actually the same tool under the hood. Both Postgres’ native logical replication and pglogical were developed by 2ndQuadrant - pglogical is the upstream for Postgres’ native logical replication, and has significantly more features (esp. for Postgres 10 and 11). pglogical is also notable for working on Postgres 9.4+, whereas native logical replication isn’t supported until versions 10+. To sum things up, the main difference between pglogical and Postgres native replication is that pglogical will have more features on older versions of Postgres (and managed services like RDS support using pglogical).

So what are the downsides?⌗

You still need to apply a database restart to install the pglogical extension. I may have lied earlier when I said that the migration was “zero-downtime”, but a single restart isn’t bad as far as these things go (plus you can choose when to do the restart, as opposed to being forced to do it as part of cutover).
Lots of stuff isn’t supported. pglogical doesn’t migrate sequences very well, if it does it at all. The documentation claims that sequences will be syncronized “periodically”, but in practice, I’ve never seen pglogical actually sync them unless you explictly force it to. There’s more details on sequence migration later in this article, but this is one of the issues where using pglogical has several major caveats.
Changes to database schema don’t get replicated whatsoever. Any changes to database table structure or anything else need to be performed separately on both the master DB and its replicas.
Replication is per-DB. If you have a postgres server with a bunch of DBs on it, you’re going to need to setup and monitor replication for each one individually. This can very quickly turn a migration into a lot of work.
A primary key is required to perform UPDATEs and DELETEs. Tables without primary keys are going to be insert-only. I have no idea why some developers keep creating tables without primary keys, but if you are unfortunate enough to have any of these individuals at your company, make sure they’re aware of this caveat before you start the migration process.
Foreign keys are ignored during replication. If you’re simply moving a database from one location to another, this is probably not a huge concern, but if want to use pglogical in a master-master replication setup, foreign key constraints aren’t going to do anything. If you want good master-master replication, stop reading this article now and switch to MariaDB.
Documentation is really sparse. Here’s the official documentation: https://www.2ndquadrant.com/en/resources/pglogical/pglogical-docs/. That’s it. There’s a few blog posts out there of questionable veracity (this one included), but as documentation goes, you’re more or less on your own. There are very few how-to’s out there and good luck asking questions on Stack Overflow if you get yourself into trouble. Please do not email me or ask me for help (yes, I don’t care if you have money).

Did you read all that and are still interested in migrating/upgrading a database? Let’s get started.

Migrate a database using pglogical⌗

Before we start, make sure you’ve completed the following pre-requisites:

Verify that both the source and target database are running Postgres 9.4+. If you’re using Postgres 9.4, be aware that there are several special considerations you need to take into account. If you are using RDS, the RDS databases must be Postgres version 9.6.10 or above - that is the earliest RDS version that supports pglogical.
You have a direct network connection between your source and target DB. It doesn’t matter what connection you’ve got as long as it works: AWS VPC peering, IPsec, Wireguard, etc. - in some cases you can even get by with an SSH tunnel. Just make sure the target db instance is able to connect to the source db.
You have superuser or equivalent privileges on both the source and target db. If you are using RDS, the rds_superuser role is sufficient.
You have read, or at least skimmed the official documentation. Don’t skip this step.

Note: if you want to use pglogical to perform a zero-downtime upgrade, setup the target database on whatever Postgres version you wish to upgrade to. If you wanted to upgrade from version 9.6 to version 12, you would install Postgres 12 on the target.

Initial pglogical setup⌗

Before you do anything else, make sure you’ve installed the pglogical package on both dbs (the next few setup steps need to be done for both source and target dbs). Follow the official documentation here: https://www.2ndquadrant.com/en/resources/pglogical/pglogical-installation-instructions/.

Add the following to your Postgres config, and restart the Postgres service to apply the changes (a reload is not sufficient to load the pglogical package).

# for more information on these values, see pglogical docs.
# these values are sufficient if you intend to migrate less than 10 databases from the source instance
wal_level = 'logical'
shared_preload_libraries = 'pglogical'
max_worker_processes = 10 
max_replication_slots = 10
max_wal_senders = 10         # 10 + previous value, if one was there
track_commit_timestamp = on  # leave this line out if using postgres 9.4

CREATE ROLE pglogical WITH LOGIN REPLICATION SUPERUSER PASSWORD 'some_password_here';

Ensure that this user is able to connect from the target instance to the source in the source instance’s pg_hba.conf (please use a secure authentication method). If you don’t know how to do this, see the official documentation for pg_hba.conf. Note that a reload is sufficient to apply changes to pg_hba.conf (restarting postgres is not necessary here).

Initial pglogical setup on RDS⌗

If you are using RDS, you’ll need to add pglogical to shared_preload_libraries in your parameter group and reboot your RDS instance - see the RDS docs on how to make changes to a parameter group if you are unsure. If it already has another value there, just add pglogical to the end (these values can be comma-separated).

Create a “pglogical” user with RDS superuser privileges (the “pglogical” user name is arbitrary):

CREATE ROLE pglogical WITH LOGIN PASSWORD 'some_password_here';
GRANT rds_replication TO pglogical; 
GRANT rds_superuser TO pglogical;

Configure source database⌗

Note: every step that follows is per-database. If the source db instance has multiple datbases you want to migrate, you’ll need to repeat these steps for each database.

At this point, we’re ready to actually setup replication. pglogical has some important terminology that we need to understand before we can continue:

A “node” represents a database. It can either be a publisher (source) or subscriber (target).
A “replication set” is a set of tables and sequences to be migrated, as well as what changes should be replicated (stuff like INSERT, UPDATE, DELETE, and/or TRUNCATE).
A “subscription” represents an actual replication connection. “Subscriber” nodes sync changes from “publisher” nodes. By default, all replication sets are migrated from the source to the target.

The replication process has three basic steps: setup the provider node and select what data to replicate, setup the subscriber node, then create a replication connection. With this in mind, let’s setup the source database now. Login to the database you wish to replicate on the source instance and perform the following:

-- Create the pglogical extension
CREATE EXTENSION pglogical;
CREATE EXTENSION pglogical_origin;  -- only on postgres 9.4, otherwise skip

-- Create the publisher node
-- The DSN represents how to connect to the source database
SELECT pglogical.create_node(
    node_name := "source",
    dsn := 'host=hostname_or_ip_address port=5432 dbname=database_to_migrate user=pglogical password=pglogical_password'
);

Now we need to select which tables should be migrated and add them to a replication set. Several replication sets were created when you ran CREATE EXTENSION pglogical;: default, default_insert_only, and ddl_sql.

The default replication set is what you should use by default and will replicate INSERT, UPDATE, DELETE, and TRUNCATE (note that TRUNCATE CASCADE doesn’t work).
default_insert_only only replicates INSERT statements. You use this for tables that don’t have a primary key.
ddl_sql is a special replication set designed to replicate schema changes. You don’t need it here because this is a one-time migration and will not be making changes to the source instance’s schema during the process. (Using the ddl_sql replication set is outside the scope of this article.)

Tables and sequences needed to be added to the replication sets individually, though there are helper functions to do this in one go:

-- Add all tables to the default replication set from the 'public' schema
SELECT pglogical.replication_set_add_all_tables('default', ARRAY['public'])

-- Check which tables have been added to all replication sets
SELECT * FROM pglogical.replication_set_table;

-- Add all sequences to the default replication set from the 'public' schema
SELECT pglogical.replication_set_add_all_sequences('default', ARRAY['public']);

-- Check which sequences have been added to all replication sets
SELECT * FROM pglogical.replication_set_seq;

If a table doesn’t have a primary key, you’ll need to remove it from the default replication set and add it to the default_insert_only replication set. Any tables created by an extension like postgis should also need to be removed from replication (extension-specific tables will be “migrated” when you import the db schema on the target db instance).

-- Remove a table from the 'default' replication set and add it to 'default_insert_only'
-- (for other table manipulations see the official documentation):
SELECT pglogical.replication_set_remove_table('default', 'table_name_here');
SELECT pglogical.replication_set_add_table('default_insert_only', 'table_name_here');

Before you proceed, check your work - note that you can investigate and view the tables that pglogical uses for replication in the pglogical schema: \dt pglogical.* will give you a list of tables you can look at. Do not attempt to manipulate these tables yourself except through the functions that pglogical provides (“Here be dragons.”).

Finally, create a schema-only dump of the source database you wish to migrate with pg_dump:

pg_dump -U pglogical -h source_database -s database_name > database_name_schema.sql

Configure the target database⌗

Create the database you want to migrate on the target:

CREATE DATABASE database_name;

Import the schema from the previous step on the target database. If you encounter errors, feel free to drop the database on the target and reimport the schema as many times as it takes to make sure you’ve fixed any errors.

psql -U pglogical -h target_database -d database_name < database_name_schema.sql

Recreate any users/roles that need to be imported (you can also dump and restore these with pg_dumpall, but this is not covered here).

With all of that done, we can setup the pglogical subscriber node.

-- Create the pglogical extension
CREATE EXTENSION pglogical;
CREATE EXTENSION pglogical_origin;  -- only on postgres 9.4, otherwise skip

-- Create the subscriber node
-- The DSN describes how to connect to the target instance
SELECT pglogical.create_node(
    node_name := 'target',
    dsn := 'host=hostname_or_ip_address port=5432 dbname=database_to_migrate user=pglogical password=pglogical_password'
);

Create the subscription⌗

Now that the source and target database nodes have been setup, we can create the replication subscription. Creating the subscription will immediately start replication, so make sure you’re ready to start before beginning this step:

-- This command is run on the target instance (the "subscriber node")
SELECT pglogical.create_subscription(
    subscription_name := 'subscription_name_here',
    provider_dsn := 'host=hostname_or_ip_address port=5432 dbname=database_to_migrate user=pglogical password=pglogical_password'
);

Now that you’ve created the subscription, check its status:

SELECT pglogical.show_subscription_status('subscription_name_here');

There are three possible “subscription statuses”:

down - This is probably what you’re going to see the first time setting this up. This means that replication has failed. This can either due to connectvity issues or an actual problem with replication. Check the Postgres logs on the source and target databases to see why.
initializing - A status of initializing means that the source database is performing the initial copy of the table data from source to target. Seeing this typically indicates success - you just need to wait until the subscription reaches replicating state.
replicating - This means that the target database node has fully replicated the entire source db, and is now replicating ongoing changes. A “replicating” db is almost fully migrated - there are several final steps you’ll need to take.

If replication is down, see below for troubleshooting instructions. Otherwise, if replication has been successful, feel free to skip the next section.

Troubleshooting replication⌗

If replication is initializing or replicating, skip this section: things are working. Otherwise, continue reading.

Before you do anything else, be aware that you can stop and start replication as needed via:

-- Pause replication
SELECT pglogical.alter_subscription_disable('subscription_name_here');

-- Resume replication where you left off
SELECT pglogical.alter_subscription_enable('subscription_name_here');

The Postgres server logs will always have the cause of replication failures - do what they say to fix things. The most likely cause of replication issues is networking issues. Again, ensure that the target database can connect to the source database on port 5432 and that the replication user can connect to the source database from the target in the source database’s pg_hba.conf.

Other issues may require resynchronizing a particular table. You can forcibly resynchronize a table with the following command (run from the source instance):

-- NOTE: will truncate the table on the target
SELECT pglogical.alter_subscription_resynchronize_table('subscription_name_here', 'table_name');

-- Check the status of the resynchronized table, pretty self-explanatory
SELECT pglogical.show_subscription_table('subscription_name_here', 'table_name');

The above command may not succeed if the table has foreign key constraints - you’ll need to manually TRUNCATE both the table you are attempting to resynchronize as well as any tables that depend on it on the target (obviously, make EXTRA SURE that you’re running the TRUNCATEs on the correct database).

Perhaps you forgot a table when setting up the initial replication sets? To fix this, you can still add the table during replication, then call pglogical.alter_subscription_resynchronize_table().

For anything else, check the official pglogical documentation. Although there’s very few how-tos in there, there’s a number of functions to do most anything you need to. Don’t be afraid to completely restart the process from the beginning if you have to (drop the target database, reimport the schema, and re-setup replication on the target instance from scratch). If you need to peek at pglogical’s current state, remember that you can get a list of tables with \dt pglogical.*.

Completing the migration⌗

If your setup has been successful and reached replicating state (if its still initializing, just wait for the initial copy to complete) you’ll need to take several steps to complete the migration.

At this point, the actual table data has been synced, but the sequences are still out of sync (the pglogical documentation claims they’ll be synced, but in practice, it doesn’t seem to happen). You can synchronize them with this command on the source instance:

-- Before running it, note that this will actually add 1000 to each sequence value on the target.
-- This is actually by design - you can read about and complain about this here:
-- https://github.com/2ndQuadrant/pglogical/issues/163
SELECT pglogical.synchronize_sequence( seqoid ) FROM pglogical.sequence_state;

-- You can check individual sequences with this command on the subscriber database:
SELECT last_value FROM sequence_name;

If you want an alternative command to sync sequences, try this command on the target database: https://wiki.postgresql.org/wiki/Fixing_Sequences

Once the sequences have been synced, you can begin the cutover process to the new database. I cannot help you with this part - the exact cutover steps you need to perform will depend on what application is connected to Postgres. One very important thing to note about the cutover process is that both the source and target are writable. The replication subscription will continue replicating data until you issue the following command:

-- Optional - temporarily stop replication:
SELECT pglogical.alter_subscription_disable('subscription_name_here');

-- Permanently disable replication
-- (re-migrating data will require repeating the migration process from scratch):
SELECT pglogical.drop_subscription('subscription_name_here');

At this point replication is terminated - you can drop the remaining pglogical nodes, extension, and roles at your convenience (they do not impact normal operation as long as there is no active replication subscription). Congratulations on a successful migration!

Near zero-downtime Postgres migrations and upgrades with pglogical