[FREE] 1979 Paper Still Running Your PostgreSQL Optimizer
Here's what Patricia Selinger designed, what survived 46 years, and what you need for PostgreSQL.
Every PostgreSQL query plan you’ve ever read is built on a paper Patricia Selinger wrote at IBM in 1979. It’s 46 years old. It still runs your database. Here’s what she designed, what survived, and what PostgreSQL had to fix.
Table of Contents
What We’re Working With
Paper
Four Ideas That Became Every SQL Optimizer
What PostgreSQL Kept From 1979
What PostgreSQL Had to Change
Where Selinger’s Assumptions Still Bite You in Production
5 Things You Can Do About It
Key Takeaway
1. What We’re Working With
Every time you run EXPLAIN ANALYZE and see a join order, a hash join chosen over a nested loop, or a cost estimate in arbitrary units you are looking at output from an algorithm designed by Patricia Selinger at IBM Research in 1979.
Not a distant ancestor of the algorithm. Algorithm. Paper’s pseudocode is close enough to PostgreSQL’s planner source code that reading one helps you read the other.
46 years later, SQL Server, Oracle, DB2, MySQL, and PostgreSQL all run variations of same design. Understanding what Selinger got right and what she didn’t tells you why your database behaves the way it does.
2. Paper
“Access Path Selection in a Relational Database Management System”
Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, Thomas G. Price.
ACM SIGMOD International Conference on Management of Data, 1979.
Paper describes the optimizer for System R IBM’s research prototype that became DB2. It solves a problem that didn’t exist before 1979: given a SQL query with multiple joins, how does the database decide which order to join the tables, which access method to use, and which algorithm to apply at each step?
Before Selinger, databases picked join orders heuristically the order the programmer wrote them. Selinger’s insight was simple and has never been improved on: enumerate all plans, cost each one, pick the cheapest.
📨 If this was useful, share it with one engineer on your team who manages production databases.
3. Four Ideas That Became Every SQL Optimizer
Idea 1: Cost as a Currency
Selinger introduced concept of cost units an abstract currency combining I/O pages read and CPU work, blended into one number so the optimizer can compare any two plans.
In PostgreSQL today, cost 1.0 is the price of fetching one 8KB page sequentially. Every parameter seq_page_cost, random_page_cost, cpu_tuple_cost is a descendant of Selinger’s original units.
Idea 2: Dynamic Programming for Join Order
Rather than trying every permutation of N tables (which is N factorial untenable beyond 8 tables), Selinger applied dynamic programming. Build up optimal plans for 2-table joins, then 3-table joins, reusing the results.
PostgreSQL uses this exact approach. Source file PostgreSQL: joinrels.c is named after the concept. Query plans you see in EXPLAIN are the winners of a dynamic-programming search.
Idea 3: Interesting Orders
An ordered intermediate result is worth more than an unordered one it can feed a merge join, avoid a sort, or support ORDER BY for free. Selinger called these “interesting orders” and proved the optimizer must track them separately.
PostgreSQL’s pathkeys are the direct descendant. When you see a plan that skips a sort because an index returned data in the right order.
Idea 4: Selectivity Estimation via Statistics
Selinger introduced the idea that the optimizer must estimate how many rows pass each filter. To do this, database maintains statistics about column distributions.
Her 1979 formulas used simple assumptions: uniform distribution, column independence, constant join selectivity factors. Every one of those assumptions is wrong on real data.
Which brings us to the second half of this post.
4. What PostgreSQL Kept From 1979
Read src/backend/optimizer/README in the PostgreSQL source. It reads like a modernized version of Selinger’s paper.
Preserved almost unchanged:
Dynamic programming for join enumeration (up to
geqo_threshold, default 12 tables)Cost units as an abstract currency
Interesting-order tracking via pathkeys
Per-node cost estimation composed from child node costs
Overall pattern: enumerate plans, cost them, pick the cheapest
5. What PostgreSQL Had to Change
The 1979 paper assumed:
Columns are independent (wrong)
Data is uniformly distributed (wrong)
Join selectivity can be estimated from single-table stats (wrong past 2 joins)
The database is small enough to cost every plan (wrong past ~12 tables)
Every major PostgreSQL statistics feature is a patch on one of these assumptions:
Histograms (for non-uniform distributions)
Most-common-value lists (for skewed data)
GEQO genetic query optimization (for queries with too many tables for dynamic programming)
Extended statistics via
CREATE STATISTICS(for correlated columns)Planner hooks and custom scan nodes (for when the default costing fails)
The VLDB 2015 paper I wrote about last week measured what happens when these patches aren’t enough. Answer: plans that are 1,000x to 100,000,000x off on large join queries.
6. Where Selinger’s Assumptions Still Bite You in Production
Three patterns I see repeatedly:
Pattern 1: Correlated predicates at the leaf
WHERE region = 'US' AND currency = 'USD' the planner multiplies selectivities as if they’re independent. They aren’t. CREATE STATISTICS is the direct fix, and the PostgreSQL 18 manual says so in Chapter 14.2.2.
Pattern 2: Join cardinality explosion
Selinger’s join selectivity formula was a constant. PostgreSQL refined it but the error compounds with join depth. By the fifth join the estimate is meaningless. This is why big analytical queries against real schemas go wildly wrong.
Pattern 3: GEQO kicks in at 12 joins and plan becomes non-deterministic
When your query has more than geqo_threshold tables, PostgreSQL switches from Selinger’s dynamic programming to a genetic algorithm. Same query can produce different plans between executions. On Sev-1 calls this is what “the query was fine yesterday” usually means.
7. 5 Things You Can Do About It
Fix 1 — Read your EXPLAIN plans with Selinger’s framework in mind
Every node has a cost. Every cost came from a row estimate. Every row estimate came from statistics. When a plan is bad, walk backward through that chain.
Fix 2 — Raise geqo_threshold if your queries routinely touch 12+ tables
Set geqo_threshold to 14 or 16 if you have CPU budget. Deterministic dynamic-programming plans beat genetic-algorithm plans almost every time.
Fix 3 — CREATE STATISTICS on the column pairs Selinger’s math can’t handle
This is direct fix for Pattern 1. Eight years old, still the biggest planner win you can give yourself in one DDL statement.
Fix 4 — Use the EXPLAIN ANALYZE, BUFFERS output to validate cost model inputs
If shared_blks_read is high and your random_page_cost is 4.0, the planner thinks random reads are slow when they aren’t. Fix random_page_cost on SSDs.
Fix 5 — Stop fighting the planner. Fix its inputs.
Selinger’s algorithm is correct given accurate statistics. The VLDB paper proved this empirically in 2015. Hints, query rewrites, and enable_seqscan = OFF are not the answer. Fresh ANALYZE, correct statistics target, extended statistics, and correct cost parameters are.
8. Key Takeaway
Your PostgreSQL optimizer is a 46-year-old algorithm with 30 years of patches on top.
Algorithm is almost always right. Patches cover the places where Selinger’s 1979 assumptions fail on real data.
If you know which assumption is failing, you know which patch to apply.
🚀 PostgreSQL Health Report — 60+ diagnostic checks in one SQL file. Statistics freshness, extended stats coverage, cost parameter audit, bloat, vacuum health.
📌 Upgrade to Premium ($8/month) — Premium subscribers get the full SQL diagnostic packs, step-by-step playbooks, and the queries I actually run during a Sev-1.
References
Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., Price, T. G. (1979). Access Path Selection in a Relational Database Management System. ACM SIGMOD. — https://dl.acm.org/doi/10.1145/582095.582099
PostgreSQL 18 Source — src/backend/optimizer/README — GitHub
PostgreSQL 18 Documentation — Genetic Query Optimizer (GEQO) — https://www.postgresql.org/docs/18/geqo.html


