The Sev-1 Database

The Sev-1 Database

[PREMIUM] PostgreSQL WAL Troubleshooting: 9 Diagnostic Queries

Diagnostic Queries for WAL Pressure, Replication Slot Lag, and Archive Failures

Haider Z @ Microsoft's avatar
Haider Z @ Microsoft
Apr 03, 2026
∙ Paid

When pg_wal is growing and you don’t know why here’s the exact sequence to run, from triage to root cause. These are ordered the way I actually use them under pressure.


Table of Contents

  1. Incident Pattern

  2. First Wrong Diagnosis: It Must Be a Write Spike

  3. Second Wrong Diagnosis: Archive Must Be Broken

  4. What WAL Pressure Actually Looks Like From Inside

  5. Query 1 — WAL Generation Rate (Run First, Every Time)

  6. Query 2 — Replication Slot Lag with Health Status

  7. Query 3 — WAL Directory Disk Usage

  8. Query 4 — Streaming Replication Lag Per Standby

  9. Query 5 — WAL Archive Status and Failures

  10. Query 6 — Long-Running Transactions Blocking WAL Cleanup

  11. Query 7 — Tables Generating the Most WAL Right Now

  12. Query 8 — Full WAL Configuration Audit

  13. Query 9 — One-Shot RCA Snapshot for the Incident Ticket

  14. Emergency Decision: When to Drop an Inactive Slot

  15. Prevention Checklist

  16. Key Takeaway

📨 If this saved you time, share it with one engineer on your team who manages production databases.

Share The Sev-1 Database

Before I share you the queries, I want to walk through how WAL incidents actually present because the first two diagnoses are almost always wrong, and knowing why saves you 20 minutes on a live call.


Incident Pattern

Disk alert fires.
The pg_wal directory is growing. Or a standby just disconnected with
ERROR: requested WAL segment has already been removed. Or a CDC pipeline is falling behind and nobody knows why.

These are three different symptoms. They can all have the same root cause or three completely different ones. Queries below are designed to rule out causes in sequence, not to throw everything at the wall simultaneously.

Three most common WAL incidents I’ve handled:

  • Inactive replication slot accumulating WAL indefinitely the consumer disappeared, the slot remained, WAL retention has no cap

  • WAL archive failure archive_command is failing silently, PostgreSQL is forced to retain WAL locally, disk fills over hours or days

  • Explosive write spike generating WAL faster than max_wal_size forced checkpoints every 2 minutes, I/O storm, secondary performance impact

Problem is they all present as disk growing or WAL too large. You have to eliminate before you diagnose.

First Wrong Diagnosis: It Must Be a Write Spike

This is the first instinct. Something changed in the application. A batch job ran. Traffic spiked. Somebody ran a data migration.

Problem with this hypothesis: write spikes generate WAL, but WAL self-limits if checkpoints and archiving are working correctly. A write spike doesn’t cause indefinite disk growth it causes a temporary spike that resolves when the spike ends and WAL is archived or recycled.

Indefinite growth means either: the slot is holding WAL hostage, or the archive is failing and PostgreSQL can’t recycle. A write spike is rarely the root cause of a disk-full incident. It’s rarely the root cause of sustained WAL growth.

Run Query 1 and Query 2 before you go down the write spike path. If the generation rate is normal but the disk is still growing the write spike hypothesis is wrong.

Second Wrong Diagnosis: Archive Must Be Broken

Reasonable second guess. If WAL isn’t being archived, PostgreSQL retains it locally. But archive failures show up in the PostgreSQL logs as

LOG: archive command failed. If you’re not seeing those logs archive is probably not the problem.

More importantly: if you have a logical replication slot with an inactive consumer, that slot holds WAL regardless of whether archiving is working. Even perfect archiving doesn’t help you PostgreSQL still can’t recycle the WAL that the slot needs.

Run Query 2 before spending time on archive configuration.
If any slot shows active = false with gigabytes of retained WAL that’s your cause.

What WAL Pressure Actually Looks Like From Inside

When I’m on a live incident and WAL is the suspect, here’s what I see in pg_stat_activity:

  • wait_event = WALWriterMain WAL writer is the bottleneck, write throughput is saturated

  • wait_event = WALWrite sessions waiting for WAL to be written before they can proceed

  • wait_event = DataFileRead on the pg_wal directory recovery or archive process reading WAL

None of these tell you why. They confirm WAL is involved. The queries below find the why.


🚀 PostgreSQL Health Report — One SQL file. 60+ diagnostic checks. Detects vacuum lag, bloat, lock chains, replication slot danger, unindexed FK columns, and more. Used by DBAs managing production at scale.

Get it here → $29


Query 1 — WAL Generation Rate (Run First, Every Time)

The first question: is WAL being generated faster than normal right now, or is this accumulated WAL from something that stopped consuming it?

-- Run once. Note the LSN. Run again 60 seconds later. Subtract.
SELECT
  pg_current_wal_lsn()                           AS current_lsn,
  pg_walfile_name(pg_current_wal_lsn())           AS current_wal_file,
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')
  )                                               AS total_wal_generated,
  now()                                           AS checked_at;

Run this twice, 60 seconds apart. Subtract the LSN values: pg_wal_lsn_diff(lsn_second, lsn_first). That gives you bytes generated per minute. Multiply by 60 for hourly rate.

Interpreting the result:

  • Generation rate is high and climbing → write spike, check what’s running in pg_stat_activity

  • Generation rate is normal but disk is growing → something is retaining WAL that should be recycled → go to Query 2

  • Generation rate is near zero but disk is growing → archive is failing and WAL is accumulating → go to Query 5

Query 2 — Replication Slot Lag with Health Status

This is the query that finds the most common cause of runaway WAL disk usage. An inactive replication slot retains all WAL generated since its last confirmed position indefinitely, with no cap unless max_slot_wal_keep_size is set.

SELECT
  slot_name,
  slot_type,
  database,
  active,
  active_pid,
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
  ) AS retained_wal_size,
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)
  ) AS unconfirmed_wal,
  CASE
    WHEN active = false
    THEN 'INACTIVE — WAL being retained indefinitely'
    WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 10737418240
    THEN 'CRITICAL — >10GB retained'
    WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824
    THEN 'WARNING — >1GB retained'
    ELSE 'OK'
  END AS health_status
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;

What to look for: Any row where active = false is retaining WAL right now. If retained_wal_size is in gigabytes that slot is your disk growth source. The slot’s consumer has gone away. The slot remains. PostgreSQL is faithfully holding every WAL segment since the consumer last confirmed.

If you see an inactive slot with gigabytes retained go directly to the Emergency section at the bottom of this post before continuing with the other queries. The rest of the diagnostic can wait. The disk cannot.

User's avatar

Continue reading this post for free, courtesy of Haider Z @ Microsoft.

Or purchase a paid subscription.
© 2026 Haider Z · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture