PostgreSQL Monitoring Strategies

Introduction

Effective database monitoring is crucial for maintaining healthy PostgreSQL deployments. Without proper monitoring, performance issues can go undetected until they escalate into serious problems affecting your applications. This guide will introduce you to essential PostgreSQL monitoring strategies that help you identify bottlenecks, predict potential issues, and ensure optimal database performance.

Whether you're managing a small development database or a large production cluster, implementing these monitoring practices will give you valuable insights into your database's health and performance characteristics.

Why Monitor PostgreSQL?

Monitoring your PostgreSQL databases provides several critical benefits:

Proactive issue detection - Identify and address problems before they affect users
Performance optimization - Discover bottlenecks and optimization opportunities
Capacity planning - Track resource usage trends to plan for future growth
Incident response - Quickly diagnose and resolve issues when they occur
SLA compliance - Ensure your database meets performance requirements

Essential PostgreSQL Metrics to Monitor

Let's explore the key metrics you should track when monitoring PostgreSQL databases.

Connection Statistics

Connections are a fundamental resource in PostgreSQL. Too many connections can degrade performance or even crash your database.

-- Check current connections
SELECT count(*), state 
FROM pg_stat_activity 
GROUP BY state;

-- Example output:
--  count |        state
-- -------+---------------------
--     23 | active
--      5 | idle
--      2 | idle in transaction

Key connection metrics to monitor:

Current connection count vs. max_connections setting
Number of idle connections
Long-running idle transactions
Connection establishment rate

Database and Table Size

Track the growth of your databases and tables to plan for storage needs and identify unexpected growth patterns.

-- Database sizes
SELECT datname, pg_size_pretty(pg_database_size(datname)) AS size
FROM pg_database
ORDER BY pg_database_size(datname) DESC;

-- Example output:
--   datname   |  size
-- ------------+---------
--  production | 42 GB
--  analytics  | 28 GB
--  testing    | 2048 MB

-- Largest tables
SELECT 
  schemaname, 
  relname, 
  pg_size_pretty(pg_total_relation_size(schemaname || '.' || relname)) AS total_size
FROM 
  pg_catalog.pg_statio_user_tables
ORDER BY 
  pg_total_relation_size(schemaname || '.' || relname) DESC
LIMIT 10;

-- Example output:
--  schemaname |     relname     | total_size
-- ------------+-----------------+------------
--  public     | orders          | 12 GB
--  public     | users           | 6 GB
--  analytics  | event_logs      | 4 GB

Query Performance

Slow queries are often the main culprit for PostgreSQL performance issues. Monitoring query execution times helps identify problematic queries.

-- Enable query tracking if not already enabled
ALTER SYSTEM SET track_activities = on;
ALTER SYSTEM SET track_counts = on;
ALTER SYSTEM SET track_io_timing = on;
ALTER SYSTEM SET track_functions = 'all';
SELECT pg_reload_conf();

-- Find slow queries
SELECT 
  query,
  calls,
  round(total_exec_time::numeric, 2) AS total_time,
  round(mean_exec_time::numeric, 2) AS mean_time,
  round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS percentage
FROM 
  pg_stat_statements
ORDER BY 
  total_exec_time DESC
LIMIT 5;

-- Example output:
--                  query                  | calls | total_time | mean_time | percentage
-- ----------------------------------------+-------+------------+-----------+------------
--  SELECT * FROM orders WHERE user_id = $1|  5243 |   12385.25 |      2.36 |      32.55
--  INSERT INTO events VALUES ($1, $2, $3) | 98541 |    9871.37 |      0.10 |      25.94
--  UPDATE users SET last_seen = $1 WH... |  8762 |    4532.16 |      0.52 |      11.91

Cache Hit Ratio

PostgreSQL relies heavily on caching for performance. A low cache hit ratio might indicate insufficient shared_buffers or inefficient queries.

-- Index and table cache hit ratios
SELECT 
  relname,
  100 * idx_blks_hit / (idx_blks_hit + idx_blks_read) AS idx_hit_ratio,
  100 * heap_blks_hit / (heap_blks_hit + heap_blks_read) AS heap_hit_ratio
FROM 
  pg_statio_user_tables
WHERE 
  idx_blks_hit + idx_blks_read > 0
ORDER BY 
  heap_blks_hit + heap_blks_read DESC
LIMIT 10;

-- Example output:
--   relname   | idx_hit_ratio | heap_hit_ratio
-- ------------+---------------+----------------
--  users      |         98.76 |          99.12
--  orders     |         95.32 |          97.85
--  products   |         99.01 |          99.54

Transaction Rate and VACUUM Activity

Tracking transaction rates helps understand database workload patterns, while monitoring VACUUM activity ensures your database is properly maintained.

-- Transaction rate
SELECT 
  datname,
  xact_commit + xact_rollback AS total_transactions,
  xact_commit,
  xact_rollback,
  round(100 * xact_commit / NULLIF(xact_commit + xact_rollback, 0), 2) AS commit_ratio
FROM 
  pg_stat_database
WHERE 
  datname NOT LIKE 'template%' 
  AND datname != 'postgres';

-- Example output:
--   datname   | total_transactions | xact_commit | xact_rollback | commit_ratio
-- ------------+--------------------+-------------+---------------+--------------
--  app_db     |             842651 |      835284 |          7367 |        99.13
--  analytics  |             125489 |      125102 |           387 |        99.69

Replication Lag

If you're using PostgreSQL replication, monitoring replication lag is essential to ensure your replicas have up-to-date data.

-- On primary server
SELECT 
  client_addr, 
  application_name,
  state,
  pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) AS write_lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes
FROM 
  pg_stat_replication;

-- Example output:
--  client_addr  | application_name |  state  | sent_lag_bytes | write_lag_bytes | flush_lag_bytes | replay_lag_bytes
-- --------------+------------------+---------+----------------+-----------------+-----------------+------------------
--  10.0.0.15    | replica1         | streaming|          1024 |            1024 |            2048 |             4096
--  10.0.0.16    | replica2         | streaming|          2048 |            2048 |            4096 |             8192

Setting Up Monitoring Tools

Let's look at some practical ways to implement PostgreSQL monitoring.

Built-in Statistics Collector

PostgreSQL includes a statistics collector that gathers information about server activity. Enable these settings in your postgresql.conf:

# Statistics monitoring
track_activities = on
track_counts = on
track_io_timing = on
track_functions = all

pg_stat_statements Extension

The pg_stat_statements extension provides detailed query performance statistics:

-- Enable pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Add to shared_preload_libraries in postgresql.conf:
-- shared_preload_libraries = 'pg_stat_statements'

-- Configure in postgresql.conf:
-- pg_stat_statements.max = 10000
-- pg_stat_statements.track = all

Monitoring Workflow with External Tools

While PostgreSQL's internal statistics are valuable, a comprehensive monitoring solution typically involves external monitoring tools. Here's a typical monitoring workflow:

Setting Up Basic Prometheus Monitoring

Here's a quick example of how to set up PostgreSQL monitoring with Prometheus and the postgres_exporter:

Install postgres_exporter:

# Download postgres_exporter
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.10.0/postgres_exporter-0.10.0.linux-amd64.tar.gz

# Extract and install
tar xvfz postgres_exporter-*.tar.gz
cd postgres_exporter-*
sudo cp postgres_exporter /usr/local/bin/

Create a configuration file:

# Create postgres_exporter.env file
echo "DATA_SOURCE_NAME=\"postgresql://postgres:password@localhost:5432/postgres?sslmode=disable\"" > postgres_exporter.env

Run the exporter:

# Run the exporter
postgres_exporter --config.file=postgres_exporter.env

Add to your Prometheus configuration:

scrape_configs:
  - job_name: 'postgresql'
    static_configs:
      - targets: ['localhost:9187']

Alerting Strategies

Monitoring is only valuable if it helps you respond to issues. Setting up alerts for critical conditions is essential:

Common Alert Thresholds

Metric	Warning Threshold	Critical Threshold
Connection usage	> 70% of max_connections	> 90% of max_connections
Replication lag	> 10MB or 1 minute	> 100MB or 5 minutes
Disk usage	> 80%	> 90%
Cache hit ratio	< 98%	< 95%
Long-running transactions	> 5 minutes	> 15 minutes
Deadlocks	> 0 in 1 hour	> 5 in 1 hour

Alert Response Playbook

When alerts trigger, having a response plan is crucial:

Verify the alert - Check if it's a false positive or a real issue
Assess impact - Determine if users or applications are affected
Identify cause - Review logs and metrics to find the root cause
Take action - Implement a fix or mitigation
Document - Record the incident and resolution for future reference

Performance Tuning Based on Monitoring Data

Monitoring data should drive your performance tuning efforts. Here are some common optimizations based on monitoring insights:

High CPU usage: Review and optimize expensive queries, consider query parallelization
High disk I/O: Increase shared_buffers, optimize indexes, implement partitioning
Connection pool saturation: Implement connection pooling with pgBouncer or Odyssey
Low cache hit ratio: Increase shared_buffers, optimize queries to use indexes better
High WAL generation rate: Adjust checkpoint settings, reduce unnecessary updates

Implementing a Monitoring Strategy

Let's put together a practical monitoring strategy for a typical PostgreSQL deployment:

Daily Monitoring Checklist

-- 1. Check database size growth
SELECT datname, pg_size_pretty(pg_database_size(datname)) FROM pg_database;

-- 2. Review slow queries
SELECT query, calls, total_exec_time, mean_exec_time 
FROM pg_stat_statements 
ORDER BY total_exec_time DESC LIMIT 10;

-- 3. Check for long-running transactions
SELECT pid, now() - xact_start AS duration, query
FROM pg_stat_activity 
WHERE state = 'active' AND now() - xact_start > '5 minutes'::interval
ORDER BY duration DESC;

-- 4. Check for table bloat
SELECT 
  schemaname, 
  relname, 
  n_dead_tup, 
  n_live_tup, 
  round(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 2) AS dead_ratio
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY dead_ratio DESC;

-- 5. Review index usage
SELECT 
  schemaname, 
  relname, 
  indexrelname, 
  idx_scan, 
  idx_tup_read, 
  idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;

Weekly Maintenance Tasks

Analyze database performance trends:
- Compare current week with previous weeks
- Look for patterns in peak usage times
- Identify gradually degrading metrics
Review and tune database parameters based on observed behavior:
- Adjust work_mem for complex queries
- Fine-tune shared_buffers based on cache hit ratios
- Optimize autovacuum settings for tables with high update/delete rates
Check and optimize top resource-consuming queries:
- Rewrite inefficient queries
- Add appropriate indexes
- Consider materialized views for complex reporting queries

Best Practices for PostgreSQL Monitoring

Start simple and expand - Begin with basic metrics and add more as you understand your workload
Focus on trends, not just thresholds - Sudden changes in patterns can indicate issues
Correlate database metrics with application metrics - Database problems often manifest as application slowness
Document your baseline - Know what "normal" looks like for your database
Automate routine checks - Use scripts to perform daily and weekly monitoring tasks
Test your monitoring - Deliberately trigger conditions to ensure alerts work
Maintain historical data - Keep at least 30 days of metrics to identify trends

Summary

Effective PostgreSQL monitoring is a cornerstone of database reliability and performance. By tracking key metrics like connection statistics, query performance, cache efficiency, and replication status, you can ensure your databases run optimally and address issues before they impact users.

Remember that monitoring is not a set-it-and-forget-it task—it requires ongoing attention and refinement as your database workload evolves. Start with the basics outlined in this guide, and gradually build a monitoring strategy tailored to your specific needs.

Additional Resources

PostgreSQL Documentation on Monitoring
Practical PostgreSQL Monitoring (Official Documentation)
pg_stat_statements Documentation

Exercises

Set up pg_stat_statements on your development PostgreSQL instance and identify your top 5 most time-consuming queries.
Create a simple bash script that collects basic PostgreSQL metrics and emails a daily report.
Implement a Grafana dashboard using Prometheus and postgres_exporter to visualize your database performance.
Design an alert strategy for your PostgreSQL database with appropriate thresholds for your workload.
Analyze a slow query using EXPLAIN ANALYZE and optimize it based on the results.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Monitor PostgreSQL?​

Essential PostgreSQL Metrics to Monitor​

Connection Statistics​

Database and Table Size​

Query Performance​

Cache Hit Ratio​

Transaction Rate and VACUUM Activity​

Replication Lag​

Setting Up Monitoring Tools​

Built-in Statistics Collector​

pg_stat_statements Extension​

Monitoring Workflow with External Tools​

Setting Up Basic Prometheus Monitoring​

Alerting Strategies​

Common Alert Thresholds​

Alert Response Playbook​

Performance Tuning Based on Monitoring Data​

Implementing a Monitoring Strategy​

Daily Monitoring Checklist​

Weekly Maintenance Tasks​

Best Practices for PostgreSQL Monitoring​

Summary​

Additional Resources​

Exercises​