How OpenAI Scaled a Single PostgreSQL Instance to 800 Million ChatGPT Users: 50 Read Replicas, PgBouncer, Cascading Replication

Table of Contents

When ChatGPT hit 800 million users, how do you design your database architecture? OpenAI’s answer surprised many: no sharding, no switch to NoSQL, just PostgreSQL — but with a few precise engineering decisions that made it hold.

TL;DR

OpenAI’s ChatGPT database is a single-primary PostgreSQL instance running on Azure, paired with ~50 read replicas to distribute query load. Connection pooling uses PgBouncer (one independent Kubernetes deployment per replica), compressing connection time from 50ms to 5ms. Write load is managed by migrating high-write subsystems to other distributed storage. Cascading replication is being tested, targeting the ability to scale replica count to 100+ without increasing replication pressure on the primary.

Background

In February 2026, OpenAI published a technical blog post revealing ChatGPT’s PostgreSQL scaling architecture. It’s one of the most honest database architecture write-ups in recent years — not bragging about how clever they are, but straightforwardly explaining the problems they encountered, their solutions, and things they deliberately chose not to do (like sharding).

Design Philosophy

Why No Sharding?

Sharding — horizontally splitting data across multiple database nodes — is typically considered a necessary step for ultra-large-scale systems. OpenAI evaluated it and chose not to, for a clear reason:

ChatGPT is a read-heavy, write-light system. Users are primarily reading conversation history and loading interface settings; writes are relatively concentrated in a few specific paths. This workload pattern can be effectively distributed with read replicas, without sharding’s complexity.

The engineering cost of sharding is extremely high. Correctly implementing sharding requires modifying hundreds of application endpoints, handling cross-shard transaction consistency, and cross-shard query routing. OpenAI estimated this would take months to years of engineering time, while the read replica approach’s latency problems had much lighter-weight solutions.

Read Replica Strategy

                    ┌─ Read Replica 1 (us-east)
Primary (writes) ───┼─ Read Replica 2 (us-east)
                    ├─ Read Replica 3 (us-west)
                    │  ... (~50 total)
                    └─ Read Replica N (eu-west)

~50 read replicas distributed across multiple geographic regions let read queries route to the nearest replica for lower latency. The application layer decides based on operation type whether to hit the primary (strongly consistent reads and writes) or a read replica (reads tolerant of slight staleness).

Core Concepts

PgBouncer Connection Pooling

PostgreSQL’s connection model is one process per connection, with ~50ms overhead to establish a connection. For a system handling millions of queries per second, this overhead accumulates into a significant bottleneck.

OpenAI’s solution: add a PgBouncer deployment in front of each read replica:

Application pods
      ↓ (few long-lived connections)
PgBouncer Kubernetes Deployment
  (transaction pooling mode)
      ↓ (connection pool)
PostgreSQL Read Replica

Transaction pooling mode: Once each query or transaction completes, the underlying connection immediately returns to the pool without waiting for the client to disconnect. This lets a small number of underlying connections serve a large number of concurrent application requests.

Result: Connection establishment time drops from 50ms to 5ms.

Deployment architecture: Each read replica has an independent Kubernetes Service backed by multiple PgBouncer pods, using K8s Service load balancing to distribute traffic.

Cascading Replication

In standard PostgreSQL primary-replica replication, the primary streams WAL (Write-Ahead Log) to all replicas. 50 replicas means the primary maintains 50 WAL stream connections — significant pressure on the primary’s network and CPU.

OpenAI is testing cascading replication with the Azure PostgreSQL team:

Primary → Intermediate Replica 1 → Downstream Replicas 1a, 1b, 1c
        → Intermediate Replica 2 → Downstream Replicas 2a, 2b, 2c

The primary only needs to maintain a few connections to intermediate replicas; those intermediate replicas forward WAL to downstream replicas. This lets the replica count scale from 50 to 100+ without increasing the primary’s replication pressure.

Managing Write Pressure

Write bottlenecks can’t be solved with read replicas. OpenAI’s approach:

Migrate high-write subsystems: Move write-heavy workloads (usage metrics, logs, session state) to distributed systems suited for high writes, letting PostgreSQL focus only on the strongly-consistent business data it excels at.
Rate-limit bulk updates: Set write rate limits on background batch updates and backfills to prevent sudden large write spikes from hitting the primary.
Strict operational discipline: Nobody runs large batch updates on the primary without first validating the primary impact in a test environment.

Comparison with Common Alternatives

Approach	Best For	Why OpenAI Didn’t Choose
Read replicas	Read-heavy, write-light	✅ OpenAI’s choice
Sharding	Write bottlenecks, massive data volume	Engineering cost too high; ChatGPT is primarily reads
NewSQL (Spanner/CockroachDB)	Global distributed strong consistency	High migration cost; existing PG architecture sufficient
NoSQL (DynamoDB/MongoDB)	Key-value or document queries	Needs relational queries; switching database cost too high

Bottom Line

The biggest lesson from OpenAI’s PostgreSQL architecture isn’t the technical details — it’s the prioritization of engineering decisions:

First understand your workload characteristics (read-heavy or write-heavy? Strong consistency requirements?)
Use minimal engineering cost to solve the biggest bottleneck (connection pooling fixed the latency problem; read replicas fixed read scaling)
Defer high-engineering-cost decisions until they’re truly necessary (sharding stays on the shelf until read replicas can’t hold)

For most systems, “replace PostgreSQL” is the last option, not the first. Make it work well first.

References

← Previous MCP in Claude Code: How Model Context Protocol Connects AI to Your Tool Ecosystem

Next → What Is Inner Safety? How Therapists Help Us Rebuild It

System Design Mock: Architecture Decisions for a Book E-Commerce Platform

For a book selling platform, the key decisions are search architecture (Elasticsearch vs full-text search), inventory consistency (strong vs eventual), and order state machine design.

#system-design #microservices #e-commerce #database #caching #api-design

tech

June 2, 2026

After a 1,000,000x AI Compute Leap: What Jeff Dean Sees Next

Jeff Dean breaks down where the million-fold AI compute gains actually came from — specialized hardware, distributed training systems, and architecture efficiency — and where the next phase is headed.

#ai #computing #jeff-dean #google #scaling #hardware #tpu