Incident date and time: May 4th 2018 2:18am Pacific Time
Affected Services: EU-Central Point of Presence - all services
*Problem Description, Impact and Resolution: *
Problem: During the course of normal operations there was a network connectivity issue that temporarily affected access to an Availability Zone in our EU-Central PoP. This caused all traffic to failover to the other AZ in the region. However, an quickly increasing surge of traffic overwhelmed the infrastructure there. Devices connected via this PoP would have experienced service issues including timeouts and latencies. Traffic began to be routed to a US-East PoP during this period. However the traffic load continued to quickly increase, adding an unexpected surge of load to both US-East and EU-Central PoPs. EU traffic was routed back to EU PoPs while the team worked to access logs on the heavily loaded machines to work to identify the cause of the traffic and begin to debug and troubleshoot. Once we were able to identify the issue we found that we were getting a inordinate amount of new connections to one specific PubNub API. The traffic was coming from a large population of endpoints, but with a signature representing a specific type of device. Blocking the massive surge in traffic from these devices API required deploying a new layer of load balancers and structuring a compound ruleset to identify and block the traffic surge. Once the bad traffic had been segregated from the good traffic, connectivity was restored for all devices.
We identified that the surge in traffic from these devices was triggered by the intermittent loss of connectivity to the PoP. This triggered the devices to follow a code path with faulty retry logic that spawned multiple threads per device, each thread repetitively hitting a specific PubNub API. This fast-growing traffic created an overwhelming number of new connections to be generated from each end point to our PoP leading to continued service disruptions. We believe this event to be triggered by an AWS network issue (we are awaiting feedback from AWS as to what happened) and was further compounded by a bug that caused excessive number of retry calls to be made to the PubNub service.
Summary Timeline (Pacific Time)
May 4th - 2:18 am - Failures start in EU-Central - 2:22 am - First page from monitoring - 2:50 am - Status Page updated - 3:41 am - Partial traffic redirected to Virginia - 3:42 am to 5:48 am - Various mitigation strategies put in place - 4:20 am - Traffic directed back from Virginia to EU - 5:48 am - Mitigation strategy put in place to filter harmful traffic with load balancer. This restored service but PubNub APIs were available on different IPs that those previously used, thus potentially affecting a small number of customers who connect only via whitelisted IPs.
May 4th/May 5th - 5:49 am (5/4) - 2:26 am (5/5) - We worked on a number of different solutions to ensure access became available also on previously used IPs that some customer may have whitelisted. We were also working with the particular customer to address code issues and scale our infrastructure to support their filtered traffic.
May 5th - 2:26 am - Made additional load balancer configuration modifications for EU pubnub.com - 9:13 am - After monitoring for an 5 hours we resolved the incident.
Root cause: A network connection issue (AZ connectivity issues) in EU-Central triggered a code path in one customer’s connected devices that caused a network connection flood peaking at/over 200m new connections/minute. During this time our infrastructure could not scale quickly enough to handle this. We tried mitigating this connection onslaught by making config changes and changing infrastructure to cope with it. Initial attempts made slight improvements but the connection rates just kept increasing. We were finally able to put a solution in place that could scale out at the rate we needed and this allowed us to route the traffic for the ps-nnn.pubnub.com domain away from the main datacenter. It took some time for the load balancer solution scale to be able to accept all the connections. During this time customers would have experienced issues (either latencies, timeouts or connection refused). Once in place this allowed the service to recover and restore normal operations/latencies. This would have put 99% of customers in a working state. Customers who filter connectivity only to previously known “whitelisted IPs” at our edge would have still been experiencing issues. We further changed our DNS infrastructure to reroute certain domains (ps-nnnn.pubnub.com) away from the main network such that we could revert all non ps-nnnn.pubnub.com traffic back to our prior network load balancers. This restored connectivity to any customer reliant on whitelist IPs as well of the rest of our customer base. At this point all customers would have been fully operational. The volume of traffic into our network was created by erant network connection retry logic. Once we identified the faulty code we were able to create synthetic response servers that would satisfy the SDK retry logic and prevent further new connections into our network. This put those clients in a ‘normal’ state versus a connection retry state. This reduce the remaining load on our network.
*Mitigation Steps and Recommended Future Preventative Measures: *
We worked closely with the customer whose devices caused the traffic flood on new connectivity logic and a subsequent firmware update.
We expect this work to be completed by : June 30
We expect this work to be completed by: June 30