Postmortem October 21st 2016

A service disruption to the PubNub Network began on October 21, 2016 at approximately 11:10 UTC and lasted intermittently until approximately 17:45 UTC. News agencies began reporting same-day the “large internet outage” affecting many popular network services and websites. The root cause was a now well-publicized, sophisticated multi-wave Distributed Denial of Service (DDoS) attack against DNS Provider Dyn.

Not all PubNub customers were affected, but many with devices in the East and West coasts of the United States experienced intermittent pubnub.com DNS resolution issues which prevented clients for establishing connectivity to stream data. Intermittent spikes in latency and connectivity globally continued through the day while Dyn mitigated the attack. DNS routes on downstream DNS providers were variably cached, lost, reset, or ignored by various ISPs, thus the impact to specific devices or PubNub customers is hard to quantitatively determine.

PubNub employs Dyn DNS as its primary DNS provider specifically because of superior fine-grained Geo-DNS capabilities. Geo-DNS is a component of a broader strategy utilized by PubNub for performant routing, distribution, and replication of traffic. At the time of PubNub’s launch, only Dyn had the Geo targeting and failover features available and operating at the scale required for PubNub (today, PubNub generates many tens-of-billions of DNS queries monthly). More recently, other DNS vendors have begun offering similar solutions. PubNub had, more recently, also begun using Route53 as a secondary DNS provider for many DNS use-cases including ‘catchall’ failover (if no Dyn geo-routing rule is matched, the lookup is resolved by Route53 for latency-based routing). However, when the Dyn attack occurred on Oct 21, PubNub was not set up to be able to quickly shift all traffic to Route53.

Since the Oct 21 attack, PubNub has made multiple changes to our DNS strategy. In the short term, we have already implemented a quick fail-over strategy leveraging other DNS providers should Dyn or any other DNS provider fall again under an attack. We are also implementing a plan to allow for multiple Geo-DNS providers to function simultaneously with dynamic distribution across providers to further mitigate any DNS-related risks. As that enhancement is deployed customers should not expect any interruption or change to their service levels.

Posted Nov 01, 2016 - 04:45 UTC

Resolved

This incident has been resolved.

Posted Oct 22, 2016 - 04:56 UTC

Monitoring

All DNS resolution issues should no longer be an issue. We are going to continue to closely monitor before we resolve this issue. Please report any further issues to PubNub Support: http://pubnub.com/support

Posted Oct 21, 2016 - 23:31 UTC

Update

We are still working to resolve the DNS resolution issues and have made much progress. We will continue to update here with further progress.

Posted Oct 21, 2016 - 21:08 UTC

Update

We continue to closely monitor the DNS resolution issues which continues to have an intermittent effect on internet traffic for many internet services. We continue to work with our DNS providers to work towards resolution.

Posted Oct 21, 2016 - 18:34 UTC

Update

All PubNub services are operating normally. Connectivity to the edge was affected in US East and West and is slowly recovering.

Posted Oct 21, 2016 - 17:20 UTC

Update

Mobile Push Notifications may also be delayed due to the ongoing issues.

Posted Oct 21, 2016 - 17:14 UTC

Update

Due to the Dyn DDoS, our support platform, Desk.com, is also experiencing issues. Their status page states, "Delays with sending and receiving emails from Postmark mailboxes."

If you contacted PubNub Support and are not receiving prompt response, it is due to this delay. Apologies for the inconvenience.

Posted Oct 21, 2016 - 17:13 UTC

Update

See Dyn status page for further details:
https://www.dynstatus.com/incidents/nlr4yrr162t8

Posted Oct 21, 2016 - 16:35 UTC

Identified

Another Dyn DDoS is occurring. Our ops engineering team is monitoring and taking corrective routing measures as required and possible.

Posted Oct 21, 2016 - 16:16 UTC