On October 20th, 2025 at 07:06 UTC, our monitoring systems alerted us to elevated error levels across multiple PubNub services in the IAD region (US-East). Some customers may have experienced increased error rates and latency, as well as intermittent issues with Presence service availability across IAD (US-East), SJC (US-West), and HND (AP-Northeast).
We quickly determined the issue was caused by a broader infrastructure outage affecting our cloud provider (AWS) in the IAD region. We initiated regional failover procedures and re-routed new connections to alternate regions. However, due to undefined steps in some of our failover processes and delays accessing some tools due to the provider issue, existing connections for some services remained degraded for longer than expected.
To restore full service, we manually reset established connections, re-routed Presence traffic to Frankfurt (EU-Central), and brought on additional infrastructure in other regions to absorb traffic. Errors were mitigated by 09:20 UTC. Later in the day, additional regional load in US-West triggered a new wave of service degradation. We responded by isolating the US-East region again and scaling up balancer capacity in US-West. PubNub services were stabilized by 13:20 UTC, and remained in a monitoring state while our infrastructure provider worked to fully resolve the underlying issue.
By 22:35 UTC, our provider reported full restoration of service. After validating stability in US-East, we completed rebalancing traffic by 23:48 UTC, and declared the incident resolved.
While this incident was caused by an external infrastructure outage, we’ve identified several opportunities to strengthen our internal readiness and response procedures.
We are consolidating and centralizing our regional failover procedures to ensure they are immediately accessible and complete for all production services. Any gaps in our process documentation for newer services will be addressed to ensure readiness before they are fully adopted into production. Additionally, we are reviewing and resolving issues with internal tooling, including inventory and DNS resolution problems, which made mitigation more difficult during the incident.
These improvements will ensure faster and more consistent responses to future infrastructure-level disruptions, and reduce potential impact on customer traffic across regions.