PostgreSQL Point-in-Time Recovery
Introduction
Database disasters happen—whether from hardware failures, human errors, or software bugs. As your applications grow in complexity and importance, so does the value of your data. PostgreSQL Point-in-Time Recovery (PITR) is an advanced feature that allows you to restore your database to any specific moment in time, rather than just to the state of your last full backup.
In this guide, we'll explore how PITR works in PostgreSQL, why it's crucial for production environments, and how to implement it step by step. By the end, you'll understand how to protect your data against various failure scenarios and recover with minimal data loss.
What is Point-in-Time Recovery?
Point-in-Time Recovery combines base backups with continuous archiving of Write-Ahead Log (WAL) files to enable restoration of a PostgreSQL database to any point in time since the base backup was taken.
The system works through two key components:
- Base Backup: A snapshot of the database files at a specific moment
- WAL Archiving: Continuous recording of all changes made to the database after the base backup
When implementing PITR, PostgreSQL records every transaction in WAL files, which can later be replayed to recreate the exact state of the database at any desired point.
Why Use Point-in-Time Recovery?
PITR provides several critical benefits:
- Precise Recovery: Restore to any specific second in time, not just to your last backup
- Minimal Data Loss: Recover right up to the moment before a disaster occurred
- Protection Against Various Failures:
- Accidental data deletion or corruption
- Failed application deployments
- Human errors (like dropping the wrong table)
- Hardware failures
Setting Up PITR in PostgreSQL
Let's walk through setting up PITR in your PostgreSQL environment.
Prerequisites
- PostgreSQL server (9.6 or newer recommended)
- Sufficient disk space for WAL archives
- Administrative access to the database server
Step 1: Configure PostgreSQL for WAL Archiving
First, we need to modify PostgreSQL's configuration to enable WAL archiving. Edit your postgresql.conf
file:
# Enable WAL archiving
wal_level = replica # or 'logical' for PostgreSQL 10+
archive_mode = on # enables archiving
archive_command = 'cp %p /path/to/archive/%f' # command to archive a WAL file
archive_timeout = 60 # maximum time between WAL switches
The archive_command
is a shell command that copies WAL files to your archive location. This can be a local directory, a network share, or a cloud storage solution.
Step 2: Set Up a Base Backup
Next, create a base backup that will serve as the starting point for recovery:
pg_basebackup -D /path/to/backup -Ft -z -P
This command creates a compressed tar file of your PostgreSQL data directory. The options mean:
-D
: Destination directory-Ft
: Output in tar format-z
: Compress with gzip-P
: Show progress
Step 3: Create a Recovery Configuration
For PostgreSQL 11 and earlier, create a recovery.conf
file in your data directory. For PostgreSQL 12 and later, create a postgresql.auto.conf
or use the standard postgresql.conf
file, and create a recovery.signal
file to indicate recovery mode.
For PostgreSQL 12+:
# In postgresql.conf or postgresql.auto.conf
restore_command = 'cp /path/to/archive/%f %p'
recovery_target_time = '2023-04-15 14:30:00'
For PostgreSQL 11 and earlier:
# In recovery.conf
restore_command = 'cp /path/to/archive/%f %p'
recovery_target_time = '2023-04-15 14:30:00'
The restore_command
tells PostgreSQL how to fetch archived WAL files during recovery.
Step 4: Perform the Recovery
To perform a recovery:
-
Stop the PostgreSQL server if it's running:
bashpg_ctl stop -D /path/to/data
-
If needed, clear or move the existing data directory and extract your base backup:
bashrm -rf /path/to/data/*
tar -xzf /path/to/backup/base.tar.gz -C /path/to/data -
For PostgreSQL 12+, create the recovery signal file:
bashtouch /path/to/data/recovery.signal
-
Start the server to begin recovery:
bashpg_ctl start -D /path/to/data
PostgreSQL will begin recovering by first restoring the base backup and then applying WAL files up to the specified recovery target time.
Real-World Example: Recovering from Accidental Table Deletion
Let's walk through a complete example of using PITR to recover from an accidental table deletion.
Scenario
- You have a production database with continuous archiving enabled
- A developer accidentally runs
DROP TABLE important_customers;
at 2:45 PM - You need to recover the database to 2:44 PM, just before the mistake
Solution
-
Stop the database server:
bashsudo systemctl stop postgresql
-
Move or rename the current data directory:
bashmv /var/lib/postgresql/14/main /var/lib/postgresql/14/main_old
-
Restore the most recent base backup:
bashmkdir /var/lib/postgresql/14/main
tar -xzf /backup/base_backup_latest.tar.gz -C /var/lib/postgresql/14/main
chown -R postgres:postgres /var/lib/postgresql/14/main -
Create recovery configuration (PostgreSQL 14):
bash# Edit postgresql.auto.conf
echo "restore_command = 'cp /path/to/archive/%f %p'" >> /var/lib/postgresql/14/main/postgresql.auto.conf
echo "recovery_target_time = '2023-04-15 14:44:00'" >> /var/lib/postgresql/14/main/postgresql.auto.conf
echo "recovery_target_action = 'promote'" >> /var/lib/postgresql/14/main/postgresql.auto.conf
# Create recovery signal file
touch /var/lib/postgresql/14/main/recovery.signal
chown postgres:postgres /var/lib/postgresql/14/main/recovery.signal -
Start the server to begin recovery:
bashsudo systemctl start postgresql
-
Monitor the recovery process:
bashsudo -u postgres psql -c "SELECT pg_is_in_recovery();"
When this returns
f
(false), the recovery is complete. -
Verify that the
important_customers
table exists and has the correct data:bashsudo -u postgres psql -d production -c "SELECT count(*) FROM important_customers;"
Best Practices for PITR
To ensure successful recovery when needed:
-
Test Recovery Regularly: Simulate disasters and practice recovery to verify your process works
-
Monitor Archiving: Ensure WAL files are being archived correctly
bash# Check for archive failures
grep -i "archive command failed" /var/log/postgresql/postgresql.log -
Secure Your Archives: Protect WAL archives with appropriate permissions and consider encryption
-
Maintain Adequate Storage: Monitor archive storage and ensure sufficient space
-
Document Recovery Procedures: Create clear, step-by-step recovery documentation for your team
Common Issues and Troubleshooting
Archive Command Failures
If your archive command fails, WAL files won't be archived properly. Check:
# Verify archive command works manually
cp postgresql_wal_file /path/to/archive/postgresql_wal_file
Ensure the destination directory exists and has proper permissions.
Missing WAL Files During Recovery
If you see errors about missing WAL files during recovery:
FATAL: could not find file "00000001000000A3000000FF": No such file or directory
This usually means there's a gap in your WAL archive. You might need to:
- Use an older base backup
- Adjust your recovery target time to before the gap
- Restore missing WAL files from another source if available
Performance Considerations
Archiving WAL files can impact performance. Consider:
- Using a dedicated disk for WAL files
- Tuning
archive_timeout
based on your recovery point objective (RPO) - Using compression to reduce storage requirements
Summary
PostgreSQL Point-in-Time Recovery is a powerful feature that enables precise database restoration, minimizing data loss in disaster scenarios. By combining base backups with continuous WAL archiving, you can recover your database to any specific moment in time.
We've covered:
- The concepts behind PITR
- Setting up WAL archiving
- Creating base backups
- Configuring and performing recovery
- A real-world recovery example
- Best practices and troubleshooting
Implementing PITR should be a standard practice for any production PostgreSQL database where data loss is unacceptable.
Additional Resources
- PostgreSQL Official Documentation on Continuous Archiving
- PostgreSQL Wiki on Point-in-Time Recovery
- Backup and Recovery Community Guide
Exercises
- Set up WAL archiving on a test PostgreSQL instance
- Create a base backup of your test database
- Make some changes to your database, then deliberately "crash" it
- Restore the database to a specific point in time before the crash
- Create a shell script to automate the backup and archiving process
- Experiment with different
archive_command
options, including sending WAL files to remote storage
Remember that mastering Point-in-Time Recovery is an essential skill for database administrators and developers working with mission-critical data.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)