Elevated latencies and errors for multiple services in US-west and US East

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution 

On October 20th, 2025 at 07:06 UTC, our monitoring systems alerted us to elevated error levels across multiple PubNub services in the IAD region (US-East). Some customers may have experienced increased error rates and latency, as well as intermittent issues with Presence service availability across IAD (US-East), SJC (US-West), and HND (AP-Northeast).

We quickly determined the issue was caused by a broader infrastructure outage affecting our cloud provider (AWS) in the IAD region. We initiated regional failover procedures and re-routed new connections to alternate regions. However, due to undefined steps in some of our failover processes and delays accessing some tools due to the provider issue, existing connections for some services remained degraded for longer than expected.

To restore full service, we manually reset established connections, re-routed Presence traffic to Frankfurt (EU-Central), and brought on additional infrastructure in other regions to absorb traffic. Errors were mitigated by 09:20 UTC. Later in the day, additional regional load in US-West triggered a new wave of service degradation. We responded by isolating the US-East region again and scaling up balancer capacity in US-West. PubNub services were stabilized by 13:20 UTC, and remained in a monitoring state while our infrastructure provider worked to fully resolve the underlying issue.

By 22:35 UTC, our provider reported full restoration of service. After validating stability in US-East, we completed rebalancing traffic by 23:48 UTC, and declared the incident resolved.

Mitigation Steps and Recommended Future Preventative Measures 

While this incident was caused by an external infrastructure outage, we’ve identified several opportunities to strengthen our internal readiness and response procedures.

We are consolidating and centralizing our regional failover procedures to ensure they are immediately accessible and complete for all production services. Any gaps in our process documentation for newer services will be addressed to ensure readiness before they are fully adopted into production. Additionally, we are reviewing and resolving issues with internal tooling, including inventory and DNS resolution problems, which made mitigation more difficult during the incident.

These improvements will ensure faster and more consistent responses to future infrastructure-level disruptions, and reduce potential impact on customer traffic across regions.

Posted Oct 27, 2025 - 18:33 UTC

Resolved

With no further issues observed for the past 30 minutes, the incident has been resolved. We will follow up soon with a root cause analysis. If you believe you experienced an impact related to this incident, please report it to PubNub Support at support@pubnub.com.
Posted Oct 20, 2025 - 09:19 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Oct 20, 2025 - 08:45 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Oct 20, 2025 - 07:58 UTC

Investigating

At approximately 07:02 UTC, PubNub services began experiencing elevated latencies and server errors in the US-West and US-East regions. PubNub Technical Staff is currently investigating, and more updates will follow once available.
Posted Oct 20, 2025 - 07:32 UTC
This incident affected: Realtime Network (Publish/Subscribe Service, Storage and Playback Service, Stream Controller Service, Presence Service, Access Manager Service, DNS Service, Mobile Push Gateway, App Context Service) and Points of Presence (North America Points of Presence).