[PREMIUM] PostgreSQL WAL Troubleshooting: 9 Diagnostic Queries
Diagnostic Queries for WAL Pressure, Replication Slot Lag, and Archive Failures
When pg_wal is growing and you don’t know why here’s the exact sequence to run, from triage to root cause. These are ordered the way I actually use them under pressure.
Table of Contents
Incident Pattern
First Wrong Diagnosis: It Must Be a Write Spike
Second Wrong Diagnosis: Archive Must Be Broken
What WAL Pressure Actually Looks Like From Inside
Query 1 — WAL Generation Rate (Run First, Every Time)
Query 2 — Replication Slot Lag with Health Status
Query 3 — WAL Directory Disk Usage
Query 4 — Streaming Replication Lag Per Standby
Query 5 — WAL Archive Status and Failures
Query 6 — Long-Running Transactions Blocking WAL Cleanup
Query 7 — Tables Generating the Most WAL Right Now
Query 8 — Full WAL Configuration Audit
Query 9 — One-Shot RCA Snapshot for the Incident Ticket
Emergency Decision: When to Drop an Inactive Slot
Prevention Checklist
Key Takeaway
📨 If this saved you time, share it with one engineer on your team who manages production databases.
Before I share you the queries, I want to walk through how WAL incidents actually present because the first two diagnoses are almost always wrong, and knowing why saves you 20 minutes on a live call.
Incident Pattern
Disk alert fires.
The pg_wal directory is growing. Or a standby just disconnected with ERROR: requested WAL segment has already been removed. Or a CDC pipeline is falling behind and nobody knows why.
These are three different symptoms. They can all have the same root cause or three completely different ones. Queries below are designed to rule out causes in sequence, not to throw everything at the wall simultaneously.
Three most common WAL incidents I’ve handled:
Inactive replication slot accumulating WAL indefinitely the consumer disappeared, the slot remained, WAL retention has no cap
WAL archive failure archive_command is failing silently, PostgreSQL is forced to retain WAL locally, disk fills over hours or days
Explosive write spike generating WAL faster than max_wal_size forced checkpoints every 2 minutes, I/O storm, secondary performance impact
Problem is they all present as disk growing or WAL too large. You have to eliminate before you diagnose.
First Wrong Diagnosis: It Must Be a Write Spike
This is the first instinct. Something changed in the application. A batch job ran. Traffic spiked. Somebody ran a data migration.
Problem with this hypothesis: write spikes generate WAL, but WAL self-limits if checkpoints and archiving are working correctly. A write spike doesn’t cause indefinite disk growth it causes a temporary spike that resolves when the spike ends and WAL is archived or recycled.
Indefinite growth means either: the slot is holding WAL hostage, or the archive is failing and PostgreSQL can’t recycle. A write spike is rarely the root cause of a disk-full incident. It’s rarely the root cause of sustained WAL growth.
Run Query 1 and Query 2 before you go down the write spike path. If the generation rate is normal but the disk is still growing the write spike hypothesis is wrong.
Second Wrong Diagnosis: Archive Must Be Broken
Reasonable second guess. If WAL isn’t being archived, PostgreSQL retains it locally. But archive failures show up in the PostgreSQL logs as
LOG: archive command failed. If you’re not seeing those logs archive is probably not the problem.
More importantly: if you have a logical replication slot with an inactive consumer, that slot holds WAL regardless of whether archiving is working. Even perfect archiving doesn’t help you PostgreSQL still can’t recycle the WAL that the slot needs.
Run Query 2 before spending time on archive configuration.
If any slot shows active = false with gigabytes of retained WAL that’s your cause.
What WAL Pressure Actually Looks Like From Inside
When I’m on a live incident and WAL is the suspect, here’s what I see in pg_stat_activity:
wait_event = WALWriterMain WAL writer is the bottleneck, write throughput is saturated
wait_event = WALWrite sessions waiting for WAL to be written before they can proceed
wait_event = DataFileRead on the pg_wal directory recovery or archive process reading WAL
None of these tell you why. They confirm WAL is involved. The queries below find the why.
🚀 PostgreSQL Health Report — One SQL file. 60+ diagnostic checks. Detects vacuum lag, bloat, lock chains, replication slot danger, unindexed FK columns, and more. Used by DBAs managing production at scale.
Query 1 — WAL Generation Rate (Run First, Every Time)
The first question: is WAL being generated faster than normal right now, or is this accumulated WAL from something that stopped consuming it?
-- Run once. Note the LSN. Run again 60 seconds later. Subtract.
SELECT
pg_current_wal_lsn() AS current_lsn,
pg_walfile_name(pg_current_wal_lsn()) AS current_wal_file,
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')
) AS total_wal_generated,
now() AS checked_at;
Run this twice, 60 seconds apart. Subtract the LSN values: pg_wal_lsn_diff(lsn_second, lsn_first). That gives you bytes generated per minute. Multiply by 60 for hourly rate.
Interpreting the result:
Generation rate is high and climbing → write spike, check what’s running in pg_stat_activity
Generation rate is normal but disk is growing → something is retaining WAL that should be recycled → go to Query 2
Generation rate is near zero but disk is growing → archive is failing and WAL is accumulating → go to Query 5
Query 2 — Replication Slot Lag with Health Status
This is the query that finds the most common cause of runaway WAL disk usage. An inactive replication slot retains all WAL generated since its last confirmed position indefinitely, with no cap unless max_slot_wal_keep_size is set.
SELECT
slot_name,
slot_type,
database,
active,
active_pid,
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS retained_wal_size,
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)
) AS unconfirmed_wal,
CASE
WHEN active = false
THEN 'INACTIVE — WAL being retained indefinitely'
WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 10737418240
THEN 'CRITICAL — >10GB retained'
WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824
THEN 'WARNING — >1GB retained'
ELSE 'OK'
END AS health_status
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;
What to look for: Any row where active = false is retaining WAL right now. If retained_wal_size is in gigabytes that slot is your disk growth source. The slot’s consumer has gone away. The slot remains. PostgreSQL is faithfully holding every WAL segment since the consumer last confirmed.
If you see an inactive slot with gigabytes retained go directly to the Emergency section at the bottom of this post before continuing with the other queries. The rest of the diagnostic can wait. The disk cannot.



