[DBEng] How Stripe Handles 500M+ Database Writes Per Day
Stripe didn’t use MySQL or PostgreSQL for their core storage layer. They built their own a MongoDB-compatible distributed database called DocDB.
Here’s the architecture behind it and what every high-write database team can take from it.
Table of Contents
Why Financial Writes Are a Different Problem
What DocDB Is In Stripe’s Own Words
The Core Architecture: Shards, Proxies, and Chunk Metadata
The Data Movement Platform: Zero-Downtime Migrations at Scale
Idempotency Keys: Exactly-Once Semantics at the API Layer
What the Idempotency Pattern Looks Like at the Database Layer
What Database Engineers Can Take From This
📨 If this was useful, share it with one engineer on your team who manages production databases.
The Sev-1 Database is a reader-supported publication. Each week I share practical takeaways from real incidents no theory, just what actually breaks and how to fix it.
In 2023, Stripe processed $1 trillion in total payments volume.
Their database infrastructure a system they built themselves maintained 99.999% uptime throughout. This post covers what that infrastructure actually looks like, drawn directly from their engineering blog and API documentation.
Why Financial Writes Are a Different Problem
Most high-scale systems accept some degree of eventual consistency. Stripe cannot. A payment either happened or it didn’t. Duplicate charges are a catastrophic failure mode, not an edge case.
Two hard requirements shaped every architectural decision:
Every write must be durable. A payment that appears to succeed but isn’t committed is a financial integrity failure.
Every write must be idempotent. At scale, network retries are inevitable. The same request will arrive more than once. The system must handle this without creating duplicates.
Standard databases handle the first constraint. The second required a deliberate design decision at both the API and storage layers.
What DocDB Is In Stripe’s Own Words
From Stripe’s engineering blog (June 2024):
Stripe’s DocDB is an extension of MongoDB Community a popular open-source database and consists of a set of services that we built in-house. It serves over five million queries per second from Stripe’s product applications.
And on why they built it rather than using an existing service:
We chose to build DocDB on top of MongoDB Community because of the flexibility of its document model and its ability to handle massive volumes of real-time data at scale. MongoDB Atlas didn’t exist in 2011, so we built a self-managed cluster of MongoDB instances running in the cloud.
The key numbers from the same post:
5 million queries per second
10,000+ distinct query shapes
Petabytes of financial data
5,000+ collections across 2,000+ database shards
This is not MySQL. It is not PostgreSQL. It is not MongoDB itself. It is a custom database-as-a-service layer built on top of MongoDB Community’s storage engine, with Stripe’s own proxy layer, replication system, and migration infrastructure on top.
🚀 PostgreSQL Health Report — One SQL file. 60+ diagnostic checks. Detects vacuum lag, bloat, lock chains, replication slot danger, unindexed FK columns, and more.
Core Architecture: Shards, Proxies, and Chunk Metadata
Stripe describes their architecture this way in the blog post:
Thousands of database shards, each housing a small chunk of the cumulative data, now underlie all of Stripe’s products. When an application sends a query to a database proxy server, it parses the query, routes it to one or more shards, combines the results from the shards, and returns them back to the application.
The routing layer works through a chunk metadata service:
They rely on a chunk metadata service that maps chunks to database shards, making it easy to look up the relevant shards for a given query.
Each physical database shard is deployed as a replica set primary node plus multiple secondary nodes with replication and automated failover. Product applications at Stripe never connect directly to database shards. They go through a fleet of database proxy servers that Stripe built internally in Go.
The proxy layer handles: reliability, scalability, admission control, and access control. It enforces query shape restrictions what Stripe calls a minimal set of database functions to avert self-inflicted issues due to suboptimal queries from client applications.
That last point is worth pausing on. Stripe’s proxy actively rejects queries that violate their access patterns. They do not give application teams open access to the database. This is an explicit architectural constraint, not just a convention.
Data Movement Platform: Zero-Downtime Migrations at Scale
The most technically detailed part of Stripe’s blog post covers their Data Movement Platform the system that allows them to split, merge, and migrate database shards with zero downtime.
From the post:
The Data Movement Platform enabled our transition from running a small number of database shards (each with tens of terabytes of data) to thousands of database shards (each with a fraction of the original data). It also provides client-transparent migrations with zero downtime, which makes it possible to build a highly elastic DBaaS offering.
In 2023 alone they used this to bin-pack underutilized databases:
From the post:
We bin-packed thousands of underutilized databases by migrating 1.5 petabytes of data transparent to product applications, and reduced the total number of underlying DocDB shards by approximately three quarters.
The traffic switch at the core of a shard migration takes under two seconds:
The entire traffic switch protocol takes less than two seconds to execute, and all failed reads and writes directed to the source shard succeed on retries.
Migration sequence has six steps: chunk registration, bulk data import, async replication from source to target, correctness check via point-in-time snapshots, traffic switch, and deregistration.
Correctness check compares snapshots rather than doing a live row-by-row comparison a deliberate choice to avoid impacting shard throughput during the verification step.
One specific optimization they mention that improved bulk load performance significantly:
By sorting the data based on the most common index attributes in the collections and inserting it in sorted order, we significantly enhanced the proximity of writes boosting write throughput by 10x.
Idempotency Keys: Exactly-Once Semantics at the API Layer
Second design pattern this post covers comes from a separate Stripe engineering post on idempotency (2017) and their official API documentation.
Stripe’s framing of the problem:
To overcome this sort of inherently unreliable environment, it’s important to design APIs and clients that will be robust in the event of failure, and will predictably bring a complex integration to a consistent state despite them.
Their definition of how idempotency keys work at the API layer, directly from their docs:
Stripe’s idempotency works by saving the resulting status code and body of the first request made for any given idempotency key, regardless of whether it succeeds or fails. Subsequent requests with the same key return the same result, including 500 errors.
And on key generation:
A client generates an idempotency key, which is a unique key that the server uses to recognize subsequent retries of the same request. How you create unique keys is up to you, but we suggest using V4 UUIDs, or another random string with enough entropy to avoid collisions.
Key expiration, from their docs:
You can remove keys from the system automatically after they’re at least 24 hours old. We generate a new request if a key is reused after the original is pruned.
Their three-principle summary from the idempotency blog post:
Make sure that failures are handled consistently. Have clients retry operations against remote services. Make sure that failures are handled safely. Use idempotency and idempotency keys to allow clients to pass a unique value and retry requests as needed. Make sure that failures are handled responsibly. Use techniques like exponential backoff and random jitter.
What the Idempotency Pattern Looks Like at the Database Layer
Stripe’s blog and docs describe the pattern at the API level. The database implementation for PostgreSQL or MySQL that makes this work is a unique constraint on the idempotency key column. This is standard SQL not Stripe-specific but it’s the mechanism that enforces exactly-once semantics at the storage layer.
-- PostgreSQL: table design that supports idempotency
CREATE TABLE payments (
id BIGSERIAL PRIMARY KEY,
idempotency_key VARCHAR(255) NOT NULL,
customer_id BIGINT NOT NULL,
amount_cents BIGINT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
created_at TIMESTAMPTZ DEFAULT now()
);
-- The constraint that enforces exactly-once semantics
CREATE UNIQUE INDEX idx_payments_idempotency_key
ON payments (idempotency_key);
-- Application insert — fails cleanly on duplicate key
INSERT INTO payments (idempotency_key, customer_id, amount_cents, status)
VALUES ($1, $2, $3, 'pending')
ON CONFLICT (idempotency_key) DO NOTHING
RETURNING *;
-- If ON CONFLICT returns nothing:
-- → record already exists, fetch and return it
-- → no duplicate charge created
-- MySQL equivalent
INSERT INTO payments (idempotency_key, customer_id, amount_cents, status)
VALUES (?, ?, ?, 'pending')
ON DUPLICATE KEY UPDATE idempotency_key = idempotency_key;
-- affected rows = 0 means duplicate — application fetches existing record
This pattern converts at-least-once delivery at the network layer into exactly-once semantics at the database layer without distributed transactions, without application-level deduplication that fails under retry pressure.
What Database Engineers Can Take From This
Building a custom database is not the takeaway. That decision was made in 2011 when MongoDB Atlas didn’t exist and Stripe had specific requirements no off-the-shelf tool satisfied. It required years of infrastructure investment. It is not a pattern to copy.
The proxy layer pattern is underused. Stripe routes all application traffic through a proxy that enforces query shape, admission control, and access control. Most teams allow applications to connect directly to the database.
Result: suboptimal queries, missing indexes, and connection pool exhaustion incidents that a proxy layer would have caught earlier.
Design for shard migration before you need it. Stripe’s Data Movement Platform exists because they needed to split shards under load without downtime. If your data has a natural tenant dimension and you haven’t modeled it that way from the start, the eventual migration will be significantly more expensive than building for it early.
Idempotency keys at the database layer are underused. Any table that receives writes over a network is vulnerable to duplicate writes on retry. A unique constraint on an operation identifier is a one-time addition. It prevents an entire class of data integrity incidents. The Stripe docs describe this as a fundamental reliability primitive, not an advanced feature.
Key Takeaway
Stripe’s core storage layer is DocDB built on MongoDB Community in 2011 because MongoDB Atlas didn’t exist yet and no off-the-shelf DBaaS met their requirements.
It now serves 5M queries/second across 2,000+ shards. The two patterns with direct applicability to every database team: a proxy layer that enforces query constraints before they reach the database, and idempotency keys at the database layer on any table that receives writes over a network. Both are design decisions, not scale decisions.
📧 Want the complete PostgreSQL Health Report?
Includes MultiXact checks, vacuum monitoring, checkpoint diagnostics, WAL slot health, and 60+ diagnostic queries — all in one SQL file.
📌 Upgrade to Premium ($8/month) — Premium subscribers get the full SQL diagnostic packs, step-by-step playbooks, and the queries I actually run during a Sev-1.
Sources — everything in this post is drawn directly from:
→ stripe.com/blog — How Stripe’s document databases supported 99.999% uptime with zero-downtime data migrations (June 2024)
→ stripe.com/blog — Designing robust and predictable APIs with idempotency (2017)
→ docs.stripe.com — Idempotent requests (official API docs)


