Intermittent errors and latency with some Functions executions and Storage and Presence in our Southern Asia PoP
Incident Report for PubNub
Postmortem

Problem Description, Impact, and Resolution

There were intermittent errors and increased latencies related to Functions executions (XHR, Vault, and Pubnub library call) as well as minor impact to Presence, Storage, and Mobile Push Webhook calls. This was isolated to our Southern Asia PoP.

Mitigation Steps and Recommended Future Preventative Measures

The root cause was a large spike in outbound request errors which caused stress to our system beyond the scale of traffic it was processing. We were able to scale to meet demand, but we also improved our monitoring and alerting around this edge case to minimize any effects in the future. We are also working to improve the resilience of the system to these issues so they will be prevented entirely.

Posted Jun 26, 2020 - 19:40 UTC

Resolved
After a significant period of monitoring, we are confident that the issue has been resolved
Posted Jun 24, 2020 - 00:35 UTC
Monitoring
A fix has been applied and we are monitoring all processes before resolving this incident.
Posted Jun 23, 2020 - 20:33 UTC
Identified
Starting at 16:15 UTC, some users in our Southern Asia PoP may be experiencing intermittent errors and increased latencies for Functions executions using XHR, Vault, or PubNub libraries as well as presence, storage, and push webhook calls.
Posted Jun 23, 2020 - 20:10 UTC
This incident affected: Functions (Functions Service, Vault), Realtime Network (Storage and Playback Service, Presence Service), and Points of Presence (Asia Pacific Points of Presence).