[PREMIUM] PostgreSQL Sev-1 Prevention Checklist
Slow query checklist tells you what’s broken right now. This one tells you what will break next week before your users find out first.
PostgreSQL Sev-1 Prevention Checklist
Table of Contents
Phase 1 Replication Slot Health (5 min)
Phase 2 Transaction ID Wraparound Risk (3 min)
Phase 3 Autovacuum Winning vs Running (10 min)
Phase 4 Lock Risk Audit (5 min)
Phase 5 Connection Layer Health (5 min)
Phase 6 WAL and Checkpoint Configuration (5 min)
Phase 7 Index Health and Bloat (10 min)
Phase 8 Schema Risk Unindexed Foreign Keys (5 min)
Phase 9 Security and Access Audit (5 min)
Phase 10 Monitoring Coverage Gaps (5 min)
Share this with one engineer on your team who owns a production PostgreSQL instance.
The Sev-1 Database Real incidents. Root causes. The queries I actually run. No theory.
Performance troubleshooting checklist I published in February tells you what is broken right now. This one is different.
This is the checklist I run on a PostgreSQL instance before it breaks. The checks that catch replication slots hours before the disk fills. The checks that catch wraparound risk weeks before PostgreSQL stops accepting writes.
Most of these take under 60 seconds each. All of them have found real problems on production instances I’ve inherited or audited.
Run every phase.
Phase 1 Replication Slot Health (5 minutes)
Replication slots are the most common source of surprise disk-full incidents I’ve handled. An inactive slot retains WAL indefinitely with no cap by default. The disk fills over days, not hours. By the time the alert fires you have minutes.
Check 1.1 Slot inventory and retained WAL size
SELECT
slot_name,
slot_type,
database,
active,
active_pid,
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS retained_wal,
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)
) AS unconfirmed_wal,
CASE
WHEN active = false
THEN 'INACTIVE — retaining WAL indefinitely'
WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 10737418240
THEN 'CRITICAL — >10GB retained'
WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824
THEN 'WARNING — >1GB retained'
ELSE 'OK'
END AS status
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;
Flag: Any slot where active = false investigate immediately. Find the slot owner. Either restart the consumer or drop the slot.
Flag: retained_wal in gigabytes on any slot disk fill is a function of your write rate and time, not a fixed threshold.
All slots active and retained WAL under 500MB healthy.
Check 1.2 WAL retention cap
SELECT name, setting, unit
FROM pg_settings
WHERE name = 'max_slot_wal_keep_size';
Flag: setting = -1 no cap. A single inactive slot can fill your entire disk.
Set this immediately:
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB'; then SELECT pg_reload_conf();Specify the maximum size of WAL files that replication slots are allowed to retain in the pg_wal directory at checkpoint time. If max_slot_wal_keep_size is -1 (the default), replication slots may retain an unlimited amount of WAL files.
Source: PostgreSQL: Documentation: 18: 19.6. Replication
Check 1.3 Physical standby lag
SELECT
application_name,
client_addr,
state,
pg_size_pretty(
pg_wal_lsn_diff(sent_lsn, replay_lsn)
) AS total_lag,
pg_size_pretty(
pg_wal_lsn_diff(write_lsn, flush_lsn)
) AS disk_write_lag,
pg_size_pretty(
pg_wal_lsn_diff(flush_lsn, replay_lsn)
) AS apply_lag,
sync_state
FROM pg_stat_replication
ORDER BY pg_wal_lsn_diff(sent_lsn, replay_lsn) DESC;
Flag: total_lag above 100MB on a synchronous standby your primary commits are waiting for a slow standby.
Flag: No rows returned when you expect standbys standby has disconnected.



