At 7:35 UTC on October 5, 2024, we received a report of intermittent failures (5xx errors) for History API requests. The issue was triggered by an unexpectedly high volume of data requests processed through our shared infrastructure, overwhelming the shared history reader containers responsible for fetching this data from our storage nodes.
As data was retrieved and processed by the history reader containers, we observed memory exhaustion (OOM-kills), even though the memory capacity had been significantly increased. This impacted the performance of our system, causing History API requests to fail when the memory overload occurred.
We took action by isolating the requests responsible for the high data volume and deploying dedicated infrastructure for them. This ensured that the issue was resolved at 00:43 UTC on October 6, and no further impact was observed across the broader customer base.
To prevent this issue from recurring, we deployed dedicated infrastructure for high-volume data requests, and we implemented dynamic data bucket creation to distribute large data volumes more efficiently, reducing strain on our nodes. These steps ensure that our system can handle sudden spikes in resource usage while maintaining stability for all customers.