Presence service errors in multiple regions

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution 

On August 1, 2025 at 22:15 UTC, we observed elevated 5xx errors across the Presence service in multiple regions. Customers may have experienced intermittent failures when attempting to receive or update presence messages.

We identified a subset of channels exhibiting highly concentrated activity patterns and applied a targeted configuration change to rebalance traffic across the cluster. The issue was resolved by 23:15 UTC on August 1, 2025.

This issue occurred because our infrastructure lacked proactive safeguards to evenly distribute presence traffic across nodes in scenarios where a small number of channels receive a disproportionately high number of presence updates. This resulted in resource saturation on some nodes without triggering early mitigation.

Mitigation Steps and Recommended Future Preventative Measures 

To prevent a similar issue from occurring in the future, we have applied a sharding configuration to affected channel patterns, which redistributes load more evenly across infrastructure components. This approach reduces the risk of overload caused by concentrated traffic.

In the coming days we will be:

  • Reviewing long-term suitability of the sharding configuration applied during this incident.
  • Investigating automation options for dynamically applying sharding logic based on real-time usage patterns.
  • Enhancing internal tooling and monitoring to better detect and respond to load imbalance scenarios before they cause service degradation.
Posted Aug 04, 2025 - 15:26 UTC

Resolved

With no further issues observed, the incident has been resolved. We will follow up soon with a root cause analysis.
If you believe you experienced an impact related to this incident, please report it to PubNub Support at support@pubnub.com.
Posted Jul 31, 2025 - 23:14 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jul 31, 2025 - 22:44 UTC

Identified

The issue has been identified and a fix is being implemented. We will provide updates here as progress is made.
Posted Jul 31, 2025 - 22:36 UTC

Investigating

We have detected elevated error levels with the Presence service in multiple regions. Our Engineers are actively working to mitigate the issues and return service to normal levels. We will provide updates here.

If you believe you have been impacted by the issue, please report impact to support@pubnub.com.
Posted Jul 31, 2025 - 22:14 UTC
This incident affected: Points of Presence (North America Points of Presence, Asia Pacific Points of Presence) and Realtime Network (Presence Service).