The Thundering Herd Problem: Mitigation Strategies for Cache Stampedes
Introduction
Your Redis cache just saved you from a 500ms database query. The key expires. In the next 100 milliseconds, 10,000 requests arrive—all missing the cache, all hitting your database simultaneously. Your DB connections max out at 200. Query time jumps to 8 seconds. More requests pile up. The cascade begins.
This is the thundering herd problem, and it’s killed more production systems than most engineers realize. Let’s explore why cache expiration is dangerous and how to survive it at scale.