๐๐ผ๐ ๐ข๐ฝ๐ฒ๐ป๐๐ ๐๐ฐ๐ฎ๐น๐ฒ๐ฑ ๐ฃ๐ผ๐๐๐ด๐ฟ๐ฒ๐ฆ๐ค๐ ๐๐ผ ๐ด๐ฌ๐ฌ ๐บ๐ถ๐น๐น๐ถ๐ผ๐ป ๐๐๐ฒ๐ฟ๐ ๐๐ถ๐๐ต ๐ฎ ๐๐ถ๐ป๐ด๐น๐ฒ ๐ถ๐ป๐๐๐ฎ๐ป๐ฐ๐ฒ?
If you work with databases at scale, this might surprise you. OpenAI serves 800 million active ChatGPT users from a single PostgreSQL primary instance. No sharding, no exotic database technology. Just one writer and about 50 read replicas handle over a million queries per second.
Here is how they did it and what problems they had.
๐ง๐ต๐ฒ ๐๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ
OpenAI uses Azure Database for PostgreSQL with a simple setup: one primary instance handles all writes, while nearly 50 read replicas handle read traffic globally. PgBouncer manages connection pooling. The response times improved from 50ms to under 5ms.
So the basic principle was to keep the primary at all costs.
๐ญ. ๐ฅ๐ฒ๐ฎ๐ฑ ๐ง๐ฟ๐ฎ๐ณ๐ณ๐ถ๐ฐ. They offload read traffic to replicas whenever possible. To prevent high-priority requests from being interfered with by low-priority queries, they created dedicated replicas for high-priority requests. Some reads must remain on the primary since they are part of write transactions.
๐ฎ. ๐ช๐ฟ๐ถ๐๐ฒ ๐ง๐ฟ๐ฎ๐ณ๐ณ๐ถ๐ฐ. For write traffic, shardable write-heavy workloads were moved to a sharded system, like Azure CosmosDB. Application optimization code did not write when it did not have to write.
Creating new tables in PostgreSQL is not permitted, and new workloads should be put in a sharded system.
๐ง๐ต๐ฒ ๐ฃ๐ผ๐๐๐ด๐ฟ๐ฒ๐ฆ๐ค๐ ๐ ๐ฉ๐๐ ๐ฃ๐ฟ๐ผ๐ฏ๐น๐ฒ๐บ
Andy Pavlo from CMU calls PostgreSQL's MVCC implementation "the worst among major relational databases." Here is why it creates challenges at scale:
๐ฏ. ๐ช๐ฟ๐ถ๐๐ฒ ๐๐บ๐ฝ๐น๐ถ๐ณ๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป. When you update one column, PostgreSQL copies the entire row. Update a single field in a table with 1000 columns, and you create a new version with 999 untouched columns. MySQL and Oracle store compact deltas instead.
๐ฐ. ๐๐ป๐ฑ๐ฒ๐
๐๐น๐ผ๐ฎ๐. Every row version can create index entries. Five versions of a single row could mean five index entries. This bloats indexes, consumes memory, and slows queries.
๐ฑ. ๐๐ฒ๐ฎ๐ฑ ๐ง๐๐ฝ๐น๐ฒ๐. Old row versions hang around indefinitely until removed by autovacuum. When writes are heavy, dead rows outpace cleaning of live rows. A table that contains 10 million live rows might therefore contain 40 million dead rows.
๐ฒ. ๐๐๐๐ผ๐๐ฎ๐ฐ๐๐๐บ ๐๐น๐ผ๐ฐ๐ธ๐. Long-running transactions block autovacuum. More dead tuples lead to stale statistics, slower queries, and longer-running transactions. A vicious cycle.
๐๐ผ๐ ๐ข๐ฝ๐ฒ๐ป๐๐ ๐ฆ๐ผ๐น๐๐ฒ๐ฑ ๐ง๐ต๐ฒ๐๐ฒ
To address this overhead issue, they introduced PgBouncer for connection pooling. This resolved the issue in an impressive manner. They also evaluated the queries generated by ORMs for executing queries on the database. Any complicated query generated by the ORMs had to be converted into raw queries.
In addition, they configured idle_in_transaction_session_timeout to prevent connections from blocking autovacuum.
For schema changes, they impose strict constraints:
- No tables are added
- Column addition has to be performed within 5 seconds
- No table rewriting is allowed
- Indexes have to be changed CONCURRENTLY
๐ง๐ต๐ฒ ๐ง๐ฟ๐ฎ๐ฑ๐ฒ-๐ผ๐ณ๐ณ
OpenAI's PostgreSQL cannot grow organically anymore. Any new functionality that requires additional database tables must use Cosmos DB. Even sharding their current implementation will take months or years, and will require changes to hundreds of their application endpoints.
Read more about it in the text in the comments.
Image: OpenAI.
Read more:openai.com/index/scalinโฆ