Elevated History, Presence, and Push Latencies in Europe PoP
Incident Report for PubNub
Postmortem

Affected Date and Time: Thursday, April 18th, 7:00 - 19:45 UTC

Affected Services: Storage, Presence, Real Time Analytics

Affected Regions: European PoPs

Problem Description, Impact, and Resolution

During the affected period customers would have experienced up to 3 seconds of delay for sending presence notifications, for persisted messages to be available via a History API call, as well as a similar latency for some mobile push notifications. Messages in other regions were unaffected. Publishes and subscribes in all regions were unaffected.

As part of routing messages we have multiple services that need to communicate together. During the time period of the latency issue, there were network connectivity issues while connecting to one of these services and a number of connections failed. Due to a very rare race condition with the retry logic in one particular service messages were retried multiple times, delaying the messages ending up being processed by downstream services. This is the first time we have experienced this and this service has been unchanged for some time.

Mitigation Steps and Recommended Future Preventative Measures

We are working on fixing the retry logic on this service so that future connection issues will no longer have the small chance of this happening. We are also improving our alerts and reducing our thresholds to ensure that we are alerted sooner when there are message latencies.

Posted May 03, 2019 - 20:11 UTC

Resolved
Between 7:45 AM and 4:45 PM UTC on April 18th we experienced increased message routing latency in our European PoPs. There were no delays for publish/subscribe round trips. However, in certain instances there was up to 3 seconds of delay for sending presence notifications, for persisted messages to be available via a History API call, as well as a similar latency for some mobile push notifications. Messages in other regions were unaffected. Publishes and subscribes in all regions were unaffected.
Posted Apr 18, 2019 - 00:45 UTC