Service Disruptions and Latencies in our EU PoP

Incident Report for PubNub

Postmortem

Incident date and time: May 4th 2018 2:18am Pacific Time

Affected Services: EU-Central Point of Presence - all services

*Problem Description, Impact and Resolution: *

Problem: During the course of normal operations there was a network connectivity issue that temporarily affected access to an Availability Zone in our EU-Central PoP. This caused all traffic to failover to the other AZ in the region. However, an quickly increasing surge of traffic overwhelmed the infrastructure there. Devices connected via this PoP would have experienced service issues including timeouts and latencies. Traffic began to be routed to a US-East PoP during this period. However the traffic load continued to quickly increase, adding an unexpected surge of load to both US-East and EU-Central PoPs. EU traffic was routed back to EU PoPs while the team worked to access logs on the heavily loaded machines to work to identify the cause of the traffic and begin to debug and troubleshoot. Once we were able to identify the issue we found that we were getting a inordinate amount of new connections to one specific PubNub API. The traffic was coming from a large population of endpoints, but with a signature representing a specific type of device. Blocking the massive surge in traffic from these devices API required deploying a new layer of load balancers and structuring a compound ruleset to identify and block the traffic surge. Once the bad traffic had been segregated from the good traffic, connectivity was restored for all devices.

We identified that the surge in traffic from these devices was triggered by the intermittent loss of connectivity to the PoP. This triggered the devices to follow a code path with faulty retry logic that spawned multiple threads per device, each thread repetitively hitting a specific PubNub API. This fast-growing traffic created an overwhelming number of new connections to be generated from each end point to our PoP leading to continued service disruptions. We believe this event to be triggered by an AWS network issue (we are awaiting feedback from AWS as to what happened) and was further compounded by a bug that caused excessive number of retry calls to be made to the PubNub service.

Summary Timeline (Pacific Time)

May 4th - 2:18 am - Failures start in EU-Central - 2:22 am - First page from monitoring - 2:50 am - Status Page updated - 3:41 am - Partial traffic redirected to Virginia - 3:42 am to 5:48 am - Various mitigation strategies put in place - 4:20 am - Traffic directed back from Virginia to EU - 5:48 am - Mitigation strategy put in place to filter harmful traffic with load balancer. This restored service but PubNub APIs were available on different IPs that those previously used, thus potentially affecting a small number of customers who connect only via whitelisted IPs.

May 4th/May 5th - 5:49 am (5/4) - 2:26 am (5/5) - We worked on a number of different solutions to ensure access became available also on previously used IPs that some customer may have whitelisted. We were also working with the particular customer to address code issues and scale our infrastructure to support their filtered traffic.

May 5th - 2:26 am - Made additional load balancer configuration modifications for EU pubnub.com - 9:13 am - After monitoring for an 5 hours we resolved the incident.

Root cause: A network connection issue (AZ connectivity issues) in EU-Central triggered a code path in one customer’s connected devices that caused a network connection flood peaking at/over 200m new connections/minute. During this time our infrastructure could not scale quickly enough to handle this. We tried mitigating this connection onslaught by making config changes and changing infrastructure to cope with it. Initial attempts made slight improvements but the connection rates just kept increasing. We were finally able to put a solution in place that could scale out at the rate we needed and this allowed us to route the traffic for the ps-nnn.pubnub.com domain away from the main datacenter. It took some time for the load balancer solution scale to be able to accept all the connections. During this time customers would have experienced issues (either latencies, timeouts or connection refused). Once in place this allowed the service to recover and restore normal operations/latencies. This would have put 99% of customers in a working state. Customers who filter connectivity only to previously known “whitelisted IPs” at our edge would have still been experiencing issues. We further changed our DNS infrastructure to reroute certain domains (ps-nnnn.pubnub.com) away from the main network such that we could revert all non ps-nnnn.pubnub.com traffic back to our prior network load balancers. This restored connectivity to any customer reliant on whitelist IPs as well of the rest of our customer base. At this point all customers would have been fully operational. The volume of traffic into our network was created by erant network connection retry logic. Once we identified the faulty code we were able to create synthetic response servers that would satisfy the SDK retry logic and prevent further new connections into our network. This put those clients in a ‘normal’ state versus a connection retry state. This reduce the remaining load on our network.

*Mitigation Steps and Recommended Future Preventative Measures: *

We worked closely with the customer whose devices caused the traffic flood on new connectivity logic and a subsequent firmware update.

Audit of retry logic in progress on all client SDKs to ensure there is back-off retry logic as appropriate. We have performed an extensive audit of our SDKs (completed)
Address any issues with connection retry logic in supported SDKs. Move all supported SDK to exponential back off and randomization, with plan to notify any large deployment customers using SDKs without backoff logic.

We expect this work to be completed by : June 30

Where appropriate, give customers custom domains e.g. .pubnub.com
Custom origin domain model will give PubNub another mechanism for routing/prioritizing traffic during incidents.

We expect this work to be completed by: June 30

Posted May 10, 2018 - 15:31 UTC

Resolved

We have been monitoring this issue for sometime and have been seeing normal operations and latencies. We are closing this incident.

Posted May 05, 2018 - 17:13 UTC

Update

We are seeing some errors in our Frankfurt PoP as a side effect of some re-configuration we are doing.. We expect these to clear quickly and service to return to normal quickly.

Posted May 05, 2018 - 05:25 UTC

Update

The current status of the network is nominal. This is good. We have progressed with our team working on making adjustments to our network. We have been successful in avoiding additional network disruptions. We still have more work to do. We are continuing work on the EU region to remerge traffic from our previous fork. It is too soon to say what the root cause was. We can describe the affect of the cause to be a cascade of retries after a network disturbance blip. We are expecting to be complete with our tasks before end of Friday PDT.

Posted May 05, 2018 - 02:21 UTC

Update

We continue to see services running at normal operating latencies and tolerance. However we are also making small adjustments to our network during which times you may see short interruptions or latencies. We hope to be through this soon and apologize for these interruptions.

Posted May 05, 2018 - 00:04 UTC

Update

We continue to see services running at normal operating latencies and tolerance. We are confident this issue is addressed. However we are leaving this incident open to monitor the situation a little longer to be conservative. We also are making small adjustments and improvements to assist customers who are reliant on static IPs on our edge. We expect this to be done shortly.

Posted May 04, 2018 - 20:32 UTC

Update

Services remain normal but we are continuing to monitor.

Posted May 04, 2018 - 16:58 UTC

Update

We are continuing to monitor for any further issues.

Posted May 04, 2018 - 16:12 UTC

Monitoring

New mitigation approach has worked and services have been operational for some time now. We will continue to monitor and will update status. Customers with white listed IPs may still be experiencing problems.

Posted May 04, 2018 - 16:07 UTC

Update

All of the leading indicators we have show that latencies and errors are returning to normal. We are still actively monitoring and will update status.

Posted May 04, 2018 - 15:15 UTC

Update

New mitigation approach has shown continued success. We are repairing downstream server health that was caused from the flood, and expected continued improvement.

Posted May 04, 2018 - 14:40 UTC

Update

We have deployed a new filter to block the network traffic causing the outages. Early indications are showing some success at mitigating the flood and we are continuing to work towards a resolution.

Posted May 04, 2018 - 14:09 UTC

Update

Unfortunately the mitigation we applied only worked temporarily. We are still seeing unusual massive traffic patterns which continue to cause latencies and errors. We are trying other mitigation strategies and will continue to provide status.

Posted May 04, 2018 - 13:31 UTC

Identified

We have put mitigations in place and are beginning to see timeouts and latencies decrease. As we route more euro traffic through the mitigation, we expect to see continued improvement.

Posted May 04, 2018 - 12:43 UTC

Update

We are experiencing abnormal traffic patterns which is causing latencies. We are working to improve latencies and are rerouting traffic where possible

Posted May 04, 2018 - 12:04 UTC

Update

We are continuing to investigate this issue.

Posted May 04, 2018 - 10:52 UTC

Update

We are still looking into the issue. We are currently failing over the region and hope that this will restore service.

Posted May 04, 2018 - 10:50 UTC

Investigating

We're investigating high latency and timeout issues in EU PoP.

Posted May 04, 2018 - 09:46 UTC

This incident affected: Realtime Network (Publish/Subscribe Service, Storage and Playback Service, Stream Controller Service, Presence Service, Access Manager Service).