Increased errors and latency in US East

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution 

On May 23, 2025 at 20:51 UTC, PubNub experienced increased errors and latency for a subset of traffic within a single availability zone in the US East region. 

The root cause was an operational error during an infrastructure change. We  unintentionally drained traffic from newly deployed load balancers. As a result, the system routed traffic incorrectly, causing elevated error rates and latency for some users.

Once the issue was identified, traffic was rerouted, and service performance returned to normal levels. The incident was resolved by 21:04 UTC the same day.

Mitigation Steps and Recommended Future Preventative Measures 

To prevent the issue from recurring in the future, the change procedure has been updated to reference isolated, non-production resources during preparation stages. If run prematurely, the script will now operate on an empty set, ensuring no production traffic is impacted.

Posted May 28, 2025 - 22:18 UTC

Resolved

Beginning at 20:51 UTC, we detected increased errors and latency for traffic within a single availability zone in the US East region. Our engineers investigated the issue and successfully restored service, which has remained stable since 21:04 UTC.
Posted May 22, 2025 - 21:34 UTC
This incident affected: Realtime Network (Publish/Subscribe Service, Storage and Playback Service, Stream Controller Service, Presence Service, Access Manager Service, Realtime Analytics Service, Mobile Push Gateway) and Points of Presence (North America Points of Presence).