At 14:43 UTC on July 22, 2025, we observed elevated latency and service degradation in our Presence system in the FRA region, impacting customers' ability to receive timely presence events. We implemented changes in our infrastructure to mitigate the pressure on the system, and the incident ended at 15:06 UTC on July 22, 2025. Unfortunately, before we were able to conclusively determine and address the root cause, the issue re-occurred from 15:33 to 15:59 UTC on July 24, 2025.
The failures started with an out-of-memory condition in some Presence service nodes. The root cause of the memory condition was a caching system—operated by a third-party vendor—that we rely heavily on for this service began responding slowly at first, and later timing out. While nodes that run into critical issues are generally taken out of service and replaced automatically, this issue began affecting our system faster than new capacity could be brought online to mitigate the incident.
Furthermore, a bug in a downstream internal system caused it to also experience failures under the pressure due to a misconfiguration in how it retried failed requests caused by the Presence system’s issues.
On the 22nd, we took manual action to break the retry cycle, scale capacity, and allow the Presence service to recover. On the 24th, we identified and addressed the root cause at the caching layer.
In the short term, we have scaled the capacity of the Presence system and we implemented changes in the affected caching layer to mitigate the issue of the vendor-provided cache system. Longer term, we are working with the vendor to fix their underlying issue, while also exploring alternatives. We also changed the retry configuration to prevent this kind of retry storm from happening again. We implemented monitoring and alerts for this failure mode, to allow us to more quickly identify this kind of issue.