The Sev-1 Database

The Sev-1 Database

[PREMIUM] PostgreSQL Extended Statistics Diagnostic with 8 Queries

Last week I wrote about the VLDB paper that proved PostgreSQL’s independence assumption drives most bad plans. Here are the 8 queries I use to find every correlated column pair.

Haider Z @ Microsoft's avatar
Haider Z @ Microsoft
Apr 28, 2026
∙ Paid

CREATE STATISTICS Audit Nobody Runs


Table of Contents

  1. Why This Audit Matters

  2. First Wrong Response: CREATE STATISTICS On Every Table

  3. Second Wrong Response: Trust Autoanalyze to Catch It

  4. How PostgreSQL Uses Extended Statistics Internally

  5. Query 1 — Existing Extended Statistics Inventory

  6. Query 2 — Top Slow Queries with Multi-Column WHERE Clauses

  7. Query 3 — Detect Functional Dependencies Automatically

  8. Query 4 — Find Estimate Divergence in pg_stat_statements

  9. Query 5 — Column Correlation Via Cross-Tabulation

  10. Query 6 — Extended Stats Coverage on Partitioned Tables

  11. Query 7 — Statistics Target Audit

  12. Query 8 — Before/After Plan Validation

  13. Diagnostic Sequence to Follow

  14. Key Takeaway


📨 If this saved you time, share it with one engineer who manages production databases.

Share The Sev-1 Database

The Sev-1 Database is a reader-supported publication. To receive new posts and support my work, consider becoming a paid subscriber 8$/month.


1. Why This Audit Matters

If you read last week’s post, you already know the finding: PostgreSQL multiplies selectivities as if every column is independent. On real data, columns are correlated. Planner underestimates. Nested loops get chosen where hash joins should win. Queries that ran in seconds start taking hours.

Fix CREATE STATISTICS has been in PostgreSQL since version 10. Most production databases have zero extended statistics objects. This audit finds where you need them and where you don’t.


2. First Wrong Response: CREATE STATISTICS On Every Table

Don’t. Extended statistics have a real cost. ANALYZE time grows. Planner work per query grows. Defining mcv across five columns on a 500M-row table can add seconds to every ANALYZE run and measurable overhead to every query plan.

Target only where the planner is getting it wrong. Query 4 tells you where.


3. Second Wrong Response: Trust Autoanalyze to Catch It

Autoanalyze updates single-column statistics. It does not create extended statistics objects. An object only exists after you run CREATE STATISTICS manually. Until then, autoanalyze can refresh its values but it cannot invent the object itself.


4. How PostgreSQL Uses Extended Statistics Internally

From the PostgreSQL 18 manual, Chapter 14.2.2:

Extended statistics objects track data about the specified table, foreign table, or materialized view. The object does not generate statistics until ANALYZE is run on the table.

Three kinds of extended statistics:

  • dependencies — functional dependency between columns. Cheapest to compute, smallest storage.

  • ndistinct — number of distinct combinations. Critical for GROUP BY on multiple columns.

  • mcv — most common value combinations. Largest storage, highest accuracy on skewed data.

Planner reads these from pg_statistic_ext_data at plan time.


5. Query 1 — Existing Extended Statistics Inventory

Start here. You need to know what already exists before creating more.

SELECT
  n.nspname AS schema_name,
  c.relname AS table_name,
  s.stxname AS stats_name,
  pg_catalog.array_to_string(
    ARRAY(
      SELECT a.attname
      FROM unnest(s.stxkeys) AS k
      JOIN pg_attribute a ON a.attrelid = s.stxrelid AND a.attnum = k
    ),
    ', '
  ) AS columns,
  s.stxkind AS kinds,
  CASE
    WHEN 'd' = ANY(s.stxkind) THEN 'dependencies '
    ELSE ''
  END ||
  CASE
    WHEN 'f' = ANY(s.stxkind) THEN 'ndistinct '
    ELSE ''
  END ||
  CASE
    WHEN 'm' = ANY(s.stxkind) THEN 'mcv'
    ELSE ''
  END AS stat_types
FROM pg_statistic_ext s
JOIN pg_class c ON c.oid = s.stxrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
ORDER BY n.nspname, c.relname;

If this returns zero rows, your database is running with the exact limitation the VLDB paper measured in 2015.

6. Query 2 — Top Slow Queries with Multi-Column WHERE Clauses

The queries most likely to benefit from extended statistics are the ones with two or more columns in the WHERE clause on the same table.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 Haider Z · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture