Increased errors and latency for presence in FRA and BOM

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution

On May 2, 2025 at 15:40 UTC, we observed elevated errors and increased latency for customers using our Presence service in the Frankfurt (FRA) and Mumbai (BOM) regions. After observing the errors and latency, we increased the memory allocation and number of replicas for the affected services, and the issue was resolved on May 2, 2025 at 16:00 UTC.

This issue occurred because we did not have adequate resource thresholds and alerting configured to proactively scale in response to a sudden spike in subscribe traffic, which led to resource exhaustion in key components of our Presence infrastructure.

Mitigation Steps and Recommended Future Preventative Measures

To prevent a similar issue from occurring in the future, we have permanently increased resource limits and replicas for the Presence service in impacted regions. In the next week, we will also improve our alerting and monitoring to detect abnormal traffic patterns earlier and trigger automated scaling where possible.

Posted May 08, 2025 - 19:44 UTC

Resolved

Beginning at 15:40 UTC we detected increased errors and latency for the Presence service in the BOM and FRA regions. Our engineers investigated the issue and were able to restore the service which remains stable as of 15:52 UTC.
Posted May 02, 2025 - 15:23 UTC