Elevated latency and errors for history service in us-east-1 region

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution

At 7:35 UTC on October 5, 2024, we received a report of intermittent failures (5xx errors) for History API requests. The issue was triggered by an unexpectedly high volume of data requests processed through our shared infrastructure, overwhelming the shared history reader containers responsible for fetching this data from our storage nodes.

As data was retrieved and processed by the history reader containers, we observed memory exhaustion (OOM-kills), even though the memory capacity had been significantly increased. This impacted the performance of our system, causing History API requests to fail when the memory overload occurred.

We took action by isolating the requests responsible for the high data volume and deploying dedicated infrastructure for them. This ensured that the issue was resolved at 00:43 UTC on October 6, and no further impact was observed across the broader customer base.

Mitigation Steps and Recommended Future Preventative Measures

To prevent this issue from recurring, we deployed dedicated infrastructure for high-volume data requests, and we implemented dynamic data bucket creation to distribute large data volumes more efficiently, reducing strain on our nodes. These steps ensure that our system can handle sudden spikes in resource usage while maintaining stability for all customers.

Posted Oct 16, 2024 - 21:52 UTC

Resolved

Beginning at around 8:00 UTC we observed increased latency and errors for our History service in one of our North America regions. The issue has been resolved as of 14:05 UTC. We will continue to monitor the incident to ensure service stability has been fully restored.

Posted Oct 05, 2024 - 14:07 UTC

Monitoring

Remediation actions have been taken. Our engineers are currently monitoring the incident to ensure the stability has been restored.

Posted Oct 05, 2024 - 13:33 UTC

Update

We are continuing to work on a fix for this issue.

Posted Oct 05, 2024 - 13:09 UTC

Update

PubNub Technical Staff still working on fixing the issue.

Posted Oct 05, 2024 - 12:34 UTC

Identified

The issue has been identified and our engineers are engaged and continue to work on the issue. Latency and errors rates are improving.

Posted Oct 05, 2024 - 11:55 UTC

Investigating

At about 8:00 AM UTC, History service started to experience elevated latencies and errors in North America PoP. PubNub Technical Staff is currently investigating and more updates will follow once available.

If you are experiencing issues and believe them to be related to this incident, please report it to PubNub Support at support@pubnub.com.

Posted Oct 05, 2024 - 11:16 UTC

This incident affected: Points of Presence (North America Points of Presence) and Realtime Network (Storage and Playback Service).