Achieving High Availability with Docker Swarm, PostgreSQL and Django

Achieving High Availability with Docker Swarm, PostgreSQL and Django

The following methods working together deliver a High Availability solution for Django by maintaining state with PostgreSQL operating under Docker Swarm.

Postgresql

PostgreSQL is a widely used relational database that supports ACID transactions. The acronym ACID stands for atomicity, consistency, isolation, and durability. These are four key properties of database transactions that PostgreSQL supports to ensure the persistence and validity of data in the database.

One method PostgreSQL uses to maintain ACID properties is Write-Ahead Logging (WAL). PostgreSQL first records any transaction on the database to the WAL log files before it writes the changes to the database cluster’s data files.

Postgresql Continuous Archiving

With continuous archiving, the WAL files are copied to secondary storage, which has a couple of benefits. For example, a secondary database cluster can use the archived WAL file for replication purposes, but you can also use the files to perform point-in-time-recovery (PITR). That is, you can use the files to rollback a database cluster to a desirable point if an accident happens.

Postgresql Streaming Replication

Streaming replication is a popular method used to horizontally scale a relational databases. It uses two or more copies of the same database cluster running on separate machines. One database cluster is referred to as the primary and serves both read and write operations; the others, referred to as the replicas, serve only read operations. Streaming replication provides high availability of a system. If the primary database cluster or server were to unexpectedly fail, the replicas are able to continue serving read operations, or (one of the replicas) become the new primary cluster.

Docker Swarm Orchestration

Swarm constantly compares the desired network state against the current state and reconciles the two if necessary. If a primary database node fails, swarm reschedules its tasks into a different node in the same or, another datacenter. When the node recovers, it will become the replica database.

Django

Django doesn’t provide any solution for handling replication lag (i.e., query inconsistencies introduced because of the time taken for a write to propagate to the replicas).

For these reasons Django, will read and write to the Primary Server only.

High Availability Solution

Working together in this manner, as depicted below, the following occurs:

  • Django writes to the Primary PostgreSQL database
  • PostgreSQL continuously archives to the hosts Volume PGARCHIVE
  • PostgreSQL continuously replicates to the hosts Volume PGDATA DC 2
  • PostgreSQL continuously archives to the hosts Volume PGARCHIVE DC 2

In this manner, in normal network operation, there are four copies of the database in two different geographical locations at any one time.

Steady State

Should the database become unreachable, the Primary and Replica Databases are swapped and the method continues as before.

Failover

Implementation

The following postgres docker container entrypoint scripts are run when the container docks. These configure the PG Archiving, Streaming Replication, ensure that users exist, configure network security, and start the required processe at the right time. In this manner, the postgres docker image is extended to become either the primary or replication service. This then facilitates docker swarm to control the postgres cluster state.

Primary Script

#!/bin/bash


if ! test -f "$PGDATA/postgresql.conf"
then
    echo "⚙️  Initialising Database" 
    su -c 'initdb -D /var/lib/postgresql/data/' postgres
fi


echo "▶️  Configuring this postgres to be the primary server"


echo "⚙️  Setting network permissions for app"
if ! grep -q '^host all all 10.0.0.0/8 trust' $PGDATA/pg_hba.conf
then
    echo "host all all 10.0.0.0/8 trust" >> $PGDATA/pg_hba.conf
fi


echo "⚙️  Set network permissions for replicas"
touch $PGDATA/pg_hba.conf
chmod 600 $PGDATA/pg_hba.conf
chown postgres:postgres $PGDATA/pg_hba.conf
if ! grep -q '^host replication all 10.0.0.0/8 trust' $PGDATA/pg_hba.conf
then
    echo "host replication all 10.0.0.0/8 trust" >> $PGDATA/pg_hba.conf
fi


echo "⚙️  Create/verify users and roles"
psql --username postgres <<-EOSQL 2>/dev/null
    CREATE ROLE $PG_REP_USER WITH LOGIN NOSUPERUSER NOCREATEDB NOCREATEROLE NOINHERIT REPLICATION CONNECTION LIMIT -1 PASSWORD '$PG_REP_PASSWORD';
    CREATE DATABASE $POSTGRES_DB;
    CREATE USER $POSTGRES_USER WITH PASSWORD '$POSTGRES_PASSWORD';
    GRANT ALL PRIVILEGES ON DATABASE $POSTGRES_DB TO $POSTGRES_USER;
EOSQL


echo "⚙️  Enable archive mode"
if grep -q '^archive_mode' $PGDATA/postgresql.conf
then
    sed -i '/^archive_mode /s/=.*$/= on/' ${PGDATA}/postgresql.conf
else
    echo "archive_mode = on" >> $PGDATA/postgresql.conf
fi


echo "⚙️  Set archive location"
if grep -q '^archive_command' $PGDATA/postgresql.conf
then
    sed -i '/archive_command/d' $PGDATA/postgresql.conf
fi
echo "archive_command = 'cp %p ${PGARCHIVE}%f'" >> $PGDATA/postgresql.conf


echo "⚙️  Set max_wal_senders to 100"
if grep -q '^max_wal_senders' $PGDATA/postgresql.conf
then
    sed -i '/^max_wal_senders /s/=.*$/= 100/' ${PGDATA}/postgresql.conf
else
    echo "max_wal_senders = 100" >> $PGDATA/postgresql.conf
fi


echo "⚙️  Set wal_level to replica"
if grep -q '^wal_level' $PGDATA/postgresql.conf
then
    sed -i '/^wal_level /s/=.*$/= replica/' ${PGDATA}/postgresql.conf
else
    echo "wal_level = replica" >> $PGDATA/postgresql.conf
fi


echo "⚙️  Reloading..." 
psql --username postgres <<-EOSQL 2>/dev/null
    SELECT pg_reload_conf();
EOSQL


echo "✅ This postgres is the primary server"

Replica Script

#!/bin/bas


if ! test -f "$PGDATA/postgresql.conf"
then
    echo "⚙️  Initialising Database" 
    su -c 'initdb -D /var/lib/postgresql/data/' postgres
fi


echo "▶️  Configuring this postgres to be the replica server"


echo "⚙️  Deleting existing database"
rm -fr $PGDATA/*
rm -fr $PGDATA/.*


echo "⚙️  Wait for primary to be reachable"
until ping -c 1 -W 1 $PG_PRIMARY_HOST; do
    echo "⏳  Waiting for primary to respond..."
    sleep 1s
done


echo "⚙️  Backup database from primary"
echo "*:*:*:$PG_REP_USER:$PG_REP_PASSWORD" > ~/.pgpass
chmod 0600 ~/.pgpass
until pg_basebackup -h $PG_PRIMARY_HOST -D ${PGDATA} -U ${PG_REP_USER} --write-recovery-conf --progress --verbose; do
    echo "⏳  Waiting for primary to connect..."
    sleep 1s
done


echo "⚙️  Setting network permissions for primary"
sed -i '/host replication/d' $PGDATA/pg_hba.conf
echo "host replication all $PG_PRIMARY_HOST md5" >> $PGDATA/pg_hba.conf


echo "⚙️  Enable archive mode"
if grep -q '^archive_mode' $PGDATA/postgresql.conf
then
    sed -i '/^archive_mode /s/=.*$/= always/' ${PGDATA}/postgresql.conf
else
    echo "archive_mode = on" >> $PGDATA/postgresql.conf
fi


echo "⚙️  Set archive location"
if grep -q '^archive_command' $PGDATA/postgresql.conf
then
    sed -i '/archive_command/d' $PGDATA/postgresql.conf
fi
echo "archive_command = 'cp %p ${PGARCHIVE}%f'" >> $PGDATA/postgresql.conf


echo "⚙️  Reloading..." 
psql --username postgres <<-EOSQL 2>/dev/null
    SELECT pg_reload_conf();
EOSQL


echo "✅ This postgres is the replica server"