On August 1, 2025 at 22:15 UTC, we observed elevated 5xx errors across the Presence service in multiple regions. Customers may have experienced intermittent failures when attempting to receive or update presence messages.
We identified a subset of channels exhibiting highly concentrated activity patterns and applied a targeted configuration change to rebalance traffic across the cluster. The issue was resolved by 23:15 UTC on August 1, 2025.
This issue occurred because our infrastructure lacked proactive safeguards to evenly distribute presence traffic across nodes in scenarios where a small number of channels receive a disproportionately high number of presence updates. This resulted in resource saturation on some nodes without triggering early mitigation.
To prevent a similar issue from occurring in the future, we have applied a sharding configuration to affected channel patterns, which redistributes load more evenly across infrastructure components. This approach reduces the risk of overload caused by concentrated traffic.
In the coming days we will be: