[PREMIUM] PostgreSQL Extended Statistics Diagnostic with 8 Queries
Last week I wrote about the VLDB paper that proved PostgreSQL’s independence assumption drives most bad plans. Here are the 8 queries I use to find every correlated column pair.
CREATE STATISTICS Audit Nobody Runs
Table of Contents
Why This Audit Matters
First Wrong Response: CREATE STATISTICS On Every Table
Second Wrong Response: Trust Autoanalyze to Catch It
How PostgreSQL Uses Extended Statistics Internally
Query 1 — Existing Extended Statistics Inventory
Query 2 — Top Slow Queries with Multi-Column WHERE Clauses
Query 3 — Detect Functional Dependencies Automatically
Query 4 — Find Estimate Divergence in pg_stat_statements
Query 5 — Column Correlation Via Cross-Tabulation
Query 6 — Extended Stats Coverage on Partitioned Tables
Query 7 — Statistics Target Audit
Query 8 — Before/After Plan Validation
Diagnostic Sequence to Follow
Key Takeaway
📨 If this saved you time, share it with one engineer who manages production databases.
1. Why This Audit Matters
If you read last week’s post, you already know the finding: PostgreSQL multiplies selectivities as if every column is independent. On real data, columns are correlated. Planner underestimates. Nested loops get chosen where hash joins should win. Queries that ran in seconds start taking hours.
Fix CREATE STATISTICS has been in PostgreSQL since version 10. Most production databases have zero extended statistics objects. This audit finds where you need them and where you don’t.
2. First Wrong Response: CREATE STATISTICS On Every Table
Don’t. Extended statistics have a real cost. ANALYZE time grows. Planner work per query grows. Defining mcv across five columns on a 500M-row table can add seconds to every ANALYZE run and measurable overhead to every query plan.
Target only where the planner is getting it wrong. Query 4 tells you where.
3. Second Wrong Response: Trust Autoanalyze to Catch It
Autoanalyze updates single-column statistics. It does not create extended statistics objects. An object only exists after you run CREATE STATISTICS manually. Until then, autoanalyze can refresh its values but it cannot invent the object itself.
4. How PostgreSQL Uses Extended Statistics Internally
From the PostgreSQL 18 manual, Chapter 14.2.2:
Extended statistics objects track data about the specified table, foreign table, or materialized view. The object does not generate statistics until ANALYZE is run on the table.
Three kinds of extended statistics:
dependencies — functional dependency between columns. Cheapest to compute, smallest storage.
ndistinct — number of distinct combinations. Critical for GROUP BY on multiple columns.
mcv — most common value combinations. Largest storage, highest accuracy on skewed data.
Planner reads these from pg_statistic_ext_data at plan time.
5. Query 1 — Existing Extended Statistics Inventory
Start here. You need to know what already exists before creating more.
SELECT
n.nspname AS schema_name,
c.relname AS table_name,
s.stxname AS stats_name,
pg_catalog.array_to_string(
ARRAY(
SELECT a.attname
FROM unnest(s.stxkeys) AS k
JOIN pg_attribute a ON a.attrelid = s.stxrelid AND a.attnum = k
),
', '
) AS columns,
s.stxkind AS kinds,
CASE
WHEN 'd' = ANY(s.stxkind) THEN 'dependencies '
ELSE ''
END ||
CASE
WHEN 'f' = ANY(s.stxkind) THEN 'ndistinct '
ELSE ''
END ||
CASE
WHEN 'm' = ANY(s.stxkind) THEN 'mcv'
ELSE ''
END AS stat_types
FROM pg_statistic_ext s
JOIN pg_class c ON c.oid = s.stxrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
ORDER BY n.nspname, c.relname;If this returns zero rows, your database is running with the exact limitation the VLDB paper measured in 2015.
6. Query 2 — Top Slow Queries with Multi-Column WHERE Clauses
The queries most likely to benefit from extended statistics are the ones with two or more columns in the WHERE clause on the same table.


