On May 2, 2025 at 15:40 UTC, we observed elevated errors and increased latency for customers using our Presence service in the Frankfurt (FRA) and Mumbai (BOM) regions. After observing the errors and latency, we increased the memory allocation and number of replicas for the affected services, and the issue was resolved on May 2, 2025 at 16:00 UTC.
This issue occurred because we did not have adequate resource thresholds and alerting configured to proactively scale in response to a sudden spike in subscribe traffic, which led to resource exhaustion in key components of our Presence infrastructure.
To prevent a similar issue from occurring in the future, we have permanently increased resource limits and replicas for the Presence service in impacted regions. In the next week, we will also improve our alerting and monitoring to detect abnormal traffic patterns earlier and trigger automated scaling where possible.