Customers experiencing higher rates of errors for our Presence service

Incident Report for PubNub

Postmortem

Problem Description: Starting at 06:20 UTC Dec 25, the PubNub Presence service had a sharp increase in errors (50x) and/or increased latencies. The incident was experienced in all regions, though at different frequencies/severities. The incident spanned six discrete timeframes ranging in duration from 20 minutes to 3:40 hours, completely resolving at 01:06 UTC on Dec 26. The incident management team responded immediately to the alert and worked through multiple phases of issue identification, mitigation, resolution, and root cause analysis. Many of the attempts at mitigation showed promising results, restoring the service at various points (as illustrated in the Impact chart below) during the incident. While temporarily successful, the final effort towards resolution was not identified until the final timeframe, when the team was able to restore the service completely to it’s normal running state.

Root Cause: The root cause was difficult to identify, as a number of distinct, seemingly unrelated, triggering events caused back pressure on the data layer, eventually causing timeouts on a code layer that runs on the data layer (designed to batch and minimize the round-trip calls between various Presence layers and the data layer). Error messages from our data layer provider were not sufficient to identify the specific module causing the timeouts. After escalations and investigation with our data layer provider we were able to identify and tune the module more effectively, also adding more telemetry to allow us to better alert, identify, tune, and resolve any future similar incidents.

Impact: Between 06:20 UTC Dec 25 and 1:06 UTC Dec 26, there were six timeframes during which the Presence service was degraded. The incident was experienced as (a) Accuracy of Presence counts, (b) an increase in 50x errors, and/or (c) increased latencies in Presence API calls returning. The specific timeframes are represented in the graph below:

The duration of the timeframes ranged from ~15 minutes to ~3:30 hours.

Resolution: There were various attempts at resolution, each providing temporary restoration of the service. These involved switching the service to an alternative “hot” data layer while the primary data layer could be restored. As indicated on the chart above, the switches between the data layers is evidenced by the immediate restoration of service. Unfortunately, the redundant data layers would each begin exhibiting the same issues, requiring a constant switching of data layers, and restoration of the affected data layer. The final resolution involved identifying the module causing the time outs, and tuning that module across all redundant Presence data layers.

Mitigation Steps and Recommended Future Preventative Measures:

The existence of multiple, synchronized data layers was able to provide some redundancy and temporary resolution, however only a tuning of the entire system together was able to completely relieve the backpressure. The main reason for the long time to fully restore was a lack of telemetry from the data layer, preventing a better understanding of root cause and direct location of the back pressure. Three preventative measures were put in place:

Working with our data layer provider, we were provided better telemetry of poorly performing modules.
We added more internal telemetry so future similar incidents can be more easily alerted/detected.
We created an updated playbook for how to detect and address any similar issues.

Posted Jan 08, 2020 - 03:22 UTC

Resolved

This incident has been resolved.

Posted Dec 25, 2019 - 11:53 UTC

Monitoring

Any customer previously experiencing the errors with Presence should no longer be impacted. We are actively monitoring.

Posted Dec 25, 2019 - 10:21 UTC

Update

We have been able to route around the problem and the error rates have dropped but we are still working on fixing the underlying problem and will keep this incident open until we do so

Posted Dec 25, 2019 - 08:51 UTC

Update

We are continuing to work on a fix for this issue.

Posted Dec 25, 2019 - 08:25 UTC

Identified

From 10:20 PM - 11:00 PM PM PST, some customers may have experienced an elevated error rate for presence calls in all regions. Similar errors were seen started at 11:40 PM PST and appear to be over.

Posted Dec 25, 2019 - 08:15 UTC

This incident affected: Points of Presence (North America Points of Presence, European Points of Presence, Asia Pacific Points of Presence, Southern Asia Points of Presence) and Realtime Network (Presence Service).