Affected Date and Time: Thursday, April 18th, 7:00 - 19:45 UTC
Affected Services: Storage, Presence, Real Time Analytics
Affected Regions: European PoPs
During the affected period customers would have experienced up to 3 seconds of delay for sending presence notifications, for persisted messages to be available via a History API call, as well as a similar latency for some mobile push notifications. Messages in other regions were unaffected. Publishes and subscribes in all regions were unaffected.
As part of routing messages we have multiple services that need to communicate together. During the time period of the latency issue, there were network connectivity issues while connecting to one of these services and a number of connections failed. Due to a very rare race condition with the retry logic in one particular service messages were retried multiple times, delaying the messages ending up being processed by downstream services. This is the first time we have experienced this and this service has been unchanged for some time.
We are working on fixing the retry logic on this service so that future connection issues will no longer have the small chance of this happening. We are also improving our alerts and reducing our thresholds to ensure that we are alerted sooner when there are message latencies.