Spike in latency and errors in EU, US-East PoPs

Incident Report for PubNub

Postmortem

Incident date and time: 6/14/2018, Intermittently between 00:21 and 01:36 UTC

Affected Services: PubNub Access Manager (resulting in difficulty reaching downstream services)

Problem Description, Impact and Resolution:

During the time window of the incident there were three events causing latencies and errors. Two of the events where the same type and occured from 00:21 to 00:36 and 01:21 to 01:36 UTC in the US-EAST PoP. These events experienced elevated latencies for publish operations. The cause was an increase in traffic from a narrow IP range that exposed a load distribution bug so that these requests were improperly routed causing the additional latency. These events were resolved by rate limiting the increased requests until routing algorithm could be corrected.

The other independent event occured from 01:13 to 01:29 UTC in an EU PoP. During this event customers may have experienced high latency and error rates across all PubNub services. A failure in our service providers network caused the PubNub Access Manager (PAM) service to have a high failure rate authenticating client permissions. The incident was resolved by mitigating the failure with capacity and traffic shaping until our service provider restored the network link.

Mitigation Steps and Recommended Future Preventative Measures:

For the first type of incident we are deploying new code to fix the traffic distribution bug and expect this problem to be eliminated in the future. Additionally, we are always actively trying to improve our operations to provide as much flexibility and robustness around traffic shaping and mitigation.

For the EU incident we are exploring more availability failover options to mitigate permission lookup failures. This will include an additional failover in the region for all authentication permissions.

Posted Jun 15, 2018 - 01:26 UTC

Resolved

Service continues at normal parameters in all regions so we have resolved this incident.

Posted Jun 14, 2018 - 02:19 UTC

Monitoring

We have identified the situation and taken corrective action, and are now monitoring to ensure the situation is resolved. Current indications show that latencies and errors have returned to normal.

Posted Jun 14, 2018 - 01:49 UTC

Update

We have discovered the cause, which was degraded authentication service in EU PoP, triggering increased errors and latencies for other PubNub services in EU and US-East for a period of up to 12 minutes. We are taking steps to resolve.

Posted Jun 14, 2018 - 01:45 UTC

Investigating

We have noticed an increase in errors and latency in many PubNub services. We are investigating the incident and will provide an update shortly.

Posted Jun 14, 2018 - 01:36 UTC

This incident affected: Realtime Network (Publish/Subscribe Service, Storage and Playback Service, Stream Controller Service, Presence Service, Access Manager Service, Realtime Analytics Service), Functions (Functions Service), and Points of Presence (North America Points of Presence, European Points of Presence).