The app for independent voices

๐—›๐—ผ๐˜„ ๐—ข๐—ฝ๐—ฒ๐—ป๐—”๐—œ ๐˜€๐—ฐ๐—ฎ๐—น๐—ฒ๐—ฑ ๐—ฃ๐—ผ๐˜€๐˜๐—ด๐—ฟ๐—ฒ๐—ฆ๐—ค๐—Ÿ ๐˜๐—ผ ๐Ÿด๐Ÿฌ๐Ÿฌ ๐—บ๐—ถ๐—น๐—น๐—ถ๐—ผ๐—ป ๐˜‚๐˜€๐—ฒ๐—ฟ๐˜€ ๐˜„๐—ถ๐˜๐—ต ๐—ฎ ๐˜€๐—ถ๐—ป๐—ด๐—น๐—ฒ ๐—ถ๐—ป๐˜€๐˜๐—ฎ๐—ป๐—ฐ๐—ฒ?

If you work with databases at scale, this might surprise you. OpenAI serves 800 million active ChatGPT users from a single PostgreSQL primary instance. No sharding, no exotic database technology. Just one writer and about 50 read replicas handle over a million queries per second.

Here is how they did it and what problems they had.

๐—ง๐—ต๐—ฒ ๐—”๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ

OpenAI uses Azure Database for PostgreSQL with a simple setup: one primary instance handles all writes, while nearly 50 read replicas handle read traffic globally. PgBouncer manages connection pooling. The response times improved from 50ms to under 5ms.

So the basic principle was to keep the primary at all costs.

๐Ÿญ. ๐—ฅ๐—ฒ๐—ฎ๐—ฑ ๐—ง๐—ฟ๐—ฎ๐—ณ๐—ณ๐—ถ๐—ฐ. They offload read traffic to replicas whenever possible. To prevent high-priority requests from being interfered with by low-priority queries, they created dedicated replicas for high-priority requests. Some reads must remain on the primary since they are part of write transactions.

๐Ÿฎ. ๐—ช๐—ฟ๐—ถ๐˜๐—ฒ ๐—ง๐—ฟ๐—ฎ๐—ณ๐—ณ๐—ถ๐—ฐ. For write traffic, shardable write-heavy workloads were moved to a sharded system, like Azure CosmosDB. Application optimization code did not write when it did not have to write.

Creating new tables in PostgreSQL is not permitted, and new workloads should be put in a sharded system.

๐—ง๐—ต๐—ฒ ๐—ฃ๐—ผ๐˜€๐˜๐—ด๐—ฟ๐—ฒ๐—ฆ๐—ค๐—Ÿ ๐— ๐—ฉ๐—–๐—– ๐—ฃ๐—ฟ๐—ผ๐—ฏ๐—น๐—ฒ๐—บ

Andy Pavlo from CMU calls PostgreSQL's MVCC implementation "the worst among major relational databases." Here is why it creates challenges at scale:

๐Ÿฏ. ๐—ช๐—ฟ๐—ถ๐˜๐—ฒ ๐—”๐—บ๐—ฝ๐—น๐—ถ๐—ณ๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป. When you update one column, PostgreSQL copies the entire row. Update a single field in a table with 1000 columns, and you create a new version with 999 untouched columns. MySQL and Oracle store compact deltas instead.

๐Ÿฐ. ๐—œ๐—ป๐—ฑ๐—ฒ๐˜… ๐—•๐—น๐—ผ๐—ฎ๐˜. Every row version can create index entries. Five versions of a single row could mean five index entries. This bloats indexes, consumes memory, and slows queries.

๐Ÿฑ. ๐——๐—ฒ๐—ฎ๐—ฑ ๐—ง๐˜‚๐—ฝ๐—น๐—ฒ๐˜€. Old row versions hang around indefinitely until removed by autovacuum. When writes are heavy, dead rows outpace cleaning of live rows. A table that contains 10 million live rows might therefore contain 40 million dead rows.

๐Ÿฒ. ๐—”๐˜‚๐˜๐—ผ๐˜ƒ๐—ฎ๐—ฐ๐˜‚๐˜‚๐—บ ๐—•๐—น๐—ผ๐—ฐ๐—ธ๐˜€. Long-running transactions block autovacuum. More dead tuples lead to stale statistics, slower queries, and longer-running transactions. A vicious cycle.

๐—›๐—ผ๐˜„ ๐—ข๐—ฝ๐—ฒ๐—ป๐—”๐—œ ๐—ฆ๐—ผ๐—น๐˜ƒ๐—ฒ๐—ฑ ๐—ง๐—ต๐—ฒ๐˜€๐—ฒ

To address this overhead issue, they introduced PgBouncer for connection pooling. This resolved the issue in an impressive manner. They also evaluated the queries generated by ORMs for executing queries on the database. Any complicated query generated by the ORMs had to be converted into raw queries.

In addition, they configured idle_in_transaction_session_timeout to prevent connections from blocking autovacuum.

For schema changes, they impose strict constraints:

- No tables are added

- Column addition has to be performed within 5 seconds

- No table rewriting is allowed

- Indexes have to be changed CONCURRENTLY

๐—ง๐—ต๐—ฒ ๐—ง๐—ฟ๐—ฎ๐—ฑ๐—ฒ-๐—ผ๐—ณ๐—ณ

OpenAI's PostgreSQL cannot grow organically anymore. Any new functionality that requires additional database tables must use Cosmos DB. Even sharding their current implementation will take months or years, and will require changes to hundreds of their application endpoints.

Read more about it in the text in the comments.

Image: OpenAI.

Read more:openai.com/index/scalinโ€ฆ

Jan 26
at
7:35 AM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.