US East Data Center Slow Write
Incident Report for PubNub
Postmortem

US East latencies have recovered. Affected services include History Writes, Presence Writes and Push Writes. Reads not affected. History/Presence/Push Read Events were not affected. Publish and Subscribe was not affected. Stream Controller was not affected. BLOCKS was not affected. Access Manager was not affected.

Root cause found with new hardware additions. We upgraded our hardware in US East with networked mounted SSD Provisioned IOPS. Replacing old server types with new ones. As per our standard operations practices, we upgrade our hardware and software. Our new servers included modern network mounted SSD. Our older hardware used local ephemeral SSDs. The new hardware routes, after operating successfully, had an effective variable disk IO performance. This variability initiated a backlog in our event pipeline engine over time. Due to the variable performance of network mounted SSD vs Local Ephemeral SSD, we began to see a backlog. This slowdown did not occur instantly of course. This change was not noticeable or alert-able according to our current expectations in our metrics. We recovered to normal latencies after adding additional disk IO throughput. Our plan going forward is to monitor variable latency on network mounted SSDs. Also we will ensure enough IO capacity is available taking into account the variable IO throughput provided to us with network mounted SSDs.

If you have questions send us an email support@pubnub.com

Posted Sep 15, 2016 - 22:11 UTC

Resolved
US EAST Latencies have recovered. Posting brief postmortem shortly.
Posted Sep 15, 2016 - 18:32 UTC
Update
Good news. Latencies are in progress to full recovery. It's not fully recovered yet.
Posted Sep 15, 2016 - 18:25 UTC
Update
We are still working on this incident. Some routes have been slowly recovering. However there are more routes we are actively recovering.
Posted Sep 15, 2016 - 18:04 UTC
Update
We've applied the route changes successfully. Latency is slowly starting a recovery. Current estimates show latency recovery within the hour. More updates as they become available. We are still working on this issue.
Posted Sep 15, 2016 - 17:19 UTC
Update
We’ve identified several potential causes of slowness. We are in process of updating routes with backups. Will post more updates shortly.
Posted Sep 15, 2016 - 16:44 UTC
Identified
We are seeing High Latency for our Data Pipeline which services our add-on features such as Storage (history), Presence Event Notifications and Push Notifications.
Posted Sep 15, 2016 - 15:41 UTC