The Sev-1 Database

The Sev-1 Database

[RCA] DELETE That Took Down Production for 3 Hours

A routine cleanup job. A 30-second maintenance window. And a database that refused to commit.

Haider Z @ Microsoft's avatar
Haider Z @ Microsoft
Mar 10, 2026
∙ Paid

Severity 1 raised.

Database is unresponsive. All services degraded.

I pulled up the monitoring dashboard. CPU pinned at 100%. Connection pool saturated. Every application server showing timeouts. The usual chaos of a system that’s about to go fully down.

But nothing had changed. No deployments. No traffic spike. No schema migration. The on-call engineer had already checked all of those.

There was one thing running: a cleanup job. A simple DELETE. It was supposed to remove old inactive accounts a batch of about 2,000 rows from a table with half a million records. The kind of operation that should finish in under a second.

It had been running for 47 minutes. The transaction could not commit, more like high commit in hang states or taking longer time. Every second it stayed open, more sessions piled up behind it waiting for locks.


Table of Contents

  1. First Wrong Diagnosis

  2. Second Wrong Diagnosis

  3. Third Wrong Diagnosis

  4. What I Found

  5. Part That Surprises People

  6. How Common Is This?

  7. Root Cause

  8. What PostgreSQL Was Actually Doing

  9. Why ON DELETE CASCADE Makes It Worse

  10. Why Deferred Constraints Shift the Symptom

  11. 2 Diagnostic Query

  12. Prevention Checklist

  13. Key Takeaway

Share The Sev-1 Database


First Wrong Diagnosis

The team’s first instinct was locks. Made sense. A long-running DELETE holding locks, blocking everything behind it. Classic PostgreSQL cascade failure.

They checked pg_locks. Yes, the session held locks. But they were normal row-level locks from the DELETE itself.

No deadlocks. No lock contention from other sessions at least not initially.

The long-running transaction was causing other sessions to pile up and wait, but that was a symptom, not the cause.

The DELETE itself was the problem. It simply would not finish.


Second Wrong Diagnosis

Next guess: bad query plan. Maybe the DELETE was doing a sequential scan on the parent table. Maybe statistics were stale. Maybe the planner had picked a catastrophic execution strategy.

They ran EXPLAIN on the DELETE. The plan looked perfectly fine. It was using the correct index on the WHERE clause. Estimated rows matched reality. The plan was exactly what you’d want to see for a targeted DELETE of 2,000 rows.

And yet it had been running for over an hour.


Third Wrong Diagnosis

Someone suggested disk I/O. Maybe the storage was saturated. They checked iostat. Disk utilization was high but weirdly, it was almost entirely reads. Not writes. A DELETE of 2,000 rows shouldn’t be doing massive reads.

They checked pg_stat_activity. The session was in active state. Wait event: DataFileRead. The database was reading. A lot. But reading what?

That’s when they called me.


What I Found

I asked one question. Looked at one thing. And within five minutes I knew exactly what was happening and why.

The cause was something that had been sitting in their schema for eight months. Silent. Invisible. Completely harmless until this cleanup job ran for the first time.

It wasn’t a lock problem.
It wasn’t a query plan problem.
It wasn’t a disk problem.
It wasn’t a PostgreSQL bug.

It was something far simpler. And PostgreSQL will never warn you about it.


Part That Surprises People

Here’s what makes this dangerous: every standard diagnostic pointed somewhere else.

  • EXPLAIN showed a clean plan

  • pg_locks showed normal locks

  • pg_stat_activity showed an active session, no obvious anomaly

  • There were no errors in the PostgreSQL logs

  • No alerts from the monitoring system until everything was already on fire

The root cause doesn’t show up in any of the places engineers instinctively check. I’ve seen DBAs spend hours chasing locks and query plans on this exact issue.

I’ve seen teams reboot the database, kill the session, restart the batch job and watch it hang again in exactly the same way.

Because they were looking in the wrong place entirely.


How Common Is This?

Very.

I’ve encountered this exact pattern at least a dozen times across different companies, different schema designs, different PostgreSQL versions. The setup is always slightly different but the failure mode is identical:

  • A DELETE on a parent table

  • That should be fast

  • That runs for minutes or hours instead

  • With no obvious explanation in the standard diagnostics

It happens because of a specific gap between what PostgreSQL enforces and what PostgreSQL creates automatically. Engineers coming from other database engines are especially vulnerable because their previous engine handled this for them.

PostgreSQL does not.


Root Cause

When I looked at their schema, I asked one question:


🚀 Get the PostgreSQL Health Report: a single SQL function that runs 60+ diagnostic checks including TXID wraparound detection, autovacuum health, much more and ready-to-run fixes.

Get it here → 29$


The Sev-1 Database is a reader-supported publication. To receive new posts and support my work, consider becoming a paid member 8$/month.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 Haider Z · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture