High Error rates and Latencies in Europe PoP

Incident Report for PubNub

Postmortem

Incident Description, Impact, and Resolution:

Incident Description: Devices connected to a PubNub Europe Point-of-Presence (PoP) experiencing high error rates, increased latency, and connectivity issues during the incident window due to a failure at the underlying data center service provider and exacerbated by failures in PubNub’s regional failover processes.
Impact to Systems/Customers: Devices connected to a PubNub Europe PoP were impacted by the incident during the incident window with an inability to connect to the PoP, and/or high error rates and increased latencies.
Incident Window: [16:22] -- [18:03] UTC.

Incident Timeline (UTC)

16:22: Incident begins: Devices for some PubNub customers automatically fail over to other PoPs based on their account configurations. Those customers experienced very minimal impact to service.
16:28: Initial investigation determines that European PoP unresponsive for some PubNub services, preventing most PubNub API calls from responding. Once it was determined that automated redundancy systems were not effective for all customers, manual steps commenced to move remaining impacted customers to other regions 16:42.
17:18: Various attempts to failover not successful - ultimately identified as an error with the recently updated manual failover process. A further failover attempt was ultimately successful at moving ~50% of the remaining impacted PoP traffic to other PoPs.
17:43: Traffic begins to resolve in the region, and PubNub moves to a “Monitoring” stance, and begins a deeper dive investigation into failover issues. Began soliciting confirmation from various customers that issues resolving in their deployed applications.
18:03: PubNub Incident Manager declared incident resolved.

Root Cause

The trigger for the incident was the PoP’s underlying data center having a catastrophic connection failure. The root cause for the outage was a failure to have PubNub Access Manager capacity distributed across many different availability zones for better intra-regional failure prevention. A contributing factor was human error in not properly following the fallback manual failover process preventing the failover from happening more quickly (which would otherwise have commenced at 16:42 UTC).

Preventative Measures

PubNub will change its distribution of Access Manager capacity intra-region within PoPs to ensure more intra-region redundancy. Manual regional failover steps and training will be reviewed, simplified, and improved to reduce the chance of errors when manual processes must occur. The automated failover approach will be reviewed and improved to encompass a broader set of edge cases and customer traffic patterns.

Posted Oct 03, 2018 - 23:57 UTC

Resolved

The incident is now completely resolved.

Posted Oct 01, 2018 - 18:03 UTC

Monitoring

We see all our services are back to healthy state, and are continuing to monitor the services closely.

Posted Oct 01, 2018 - 17:43 UTC

Identified

We've identified the issue and actively working towards resolution.

Posted Oct 01, 2018 - 17:08 UTC

Investigating

We are investigating increased errors in our EU data center for all services.

Posted Oct 01, 2018 - 16:39 UTC

This incident affected: Points of Presence (European Points of Presence).