Customers experiencing elevated latencies and rates of errors for our Presence service

Incident Report for PubNub

Postmortem

Problem Description: Starting at 06:20 UTC Dec 25, the PubNub Presence service had a sharp increase in errors (50x) and/or increased latencies. The incident was experienced in all regions, though at different frequencies/severities. The incident spanned six discrete timeframes ranging in duration from 20 minutes to 3:40 hours, completely resolving at 01:06 UTC on Dec 26. The incident management team responded immediately to the alert and worked through multiple phases of issue identification, mitigation, resolution, and root cause analysis. Many of the attempts at mitigation showed promising results, restoring the service at various points (as illustrated in the Impact chart below) during the incident. While temporarily successful, the final effort towards resolution was not identified until the final timeframe, when the team was able to restore the service completely to it’s normal running state.

Root Cause: The root cause was difficult to identify, as a number of distinct, seemingly unrelated, triggering events caused back pressure on the data layer, eventually causing timeouts on a code layer that runs on the data layer (designed to batch and minimize the round-trip calls between various Presence layers and the data layer). Error messages from our data layer provider were not sufficient to identify the specific module causing the timeouts. After escalations and investigation with our data layer provider we were able to identify and tune the module more effectively, also adding more telemetry to allow us to better alert, identify, tune, and resolve any future similar incidents.

Impact: Between 06:20 UTC Dec 25 and 1:06 UTC Dec 26, there were six timeframes during which the Presence service was degraded. The incident was experienced as (a) Accuracy of Presence counts, (b) an increase in 50x errors, and/or (c) increased latencies in Presence API calls returning. The specific timeframes are represented in the graph below:

The duration of the timeframes ranged from ~15 minutes to ~3:30 hours.

Resolution: There were various attempts at resolution, each providing temporary restoration of the service. These involved switching the service to an alternative “hot” data layer while the primary data layer could be restored. As indicated on the chart above, the switches between the data layers is evidenced by the immediate restoration of service. Unfortunately, the redundant data layers would each begin exhibiting the same issues, requiring a constant switching of data layers, and restoration of the affected data layer. The final resolution involved identifying the module causing the time outs, and tuning that module across all redundant Presence data layers.

Mitigation Steps and Recommended Future Preventative Measures:

The existence of multiple, synchronized data layers was able to provide some redundancy and temporary resolution, however only a tuning of the entire system together was able to completely relieve the backpressure. The main reason for the long time to fully restore was a lack of telemetry from the data layer, preventing a better understanding of root cause and direct location of the back pressure. Three preventative measures were put in place:

Working with our data layer provider, we were provided better telemetry of poorly performing modules.
We added more internal telemetry so future similar incidents can be more easily alerted/detected.
We created an updated playbook for how to detect and address any similar issues.

Posted Jan 08, 2020 - 03:21 UTC

Resolved

After monitoring for 24 hours, and with all systems stable throughout, this incident has been resolved.

Posted Dec 27, 2019 - 01:30 UTC

Monitoring

Presence is now functioning normally. Due to several incident events throughout the day, we have decided to maintain a heightened monitoring state. We will fully resolve the incident, status page wise, once all internal systems are exhibiting normal performance for an extended period of time.

Posted Dec 26, 2019 - 01:29 UTC

Update

We are continuing to investigate this issue.

Posted Dec 25, 2019 - 23:05 UTC

Investigating

Presence service errors are coming back from all regions, we are currently investigating this issue on High priority.

Posted Dec 25, 2019 - 22:07 UTC

Monitoring

The root cause has been identified and a solution deployed, now we are monitoring the results.

Posted Dec 25, 2019 - 20:54 UTC

Identified

The issue has been identified and we are seeing recovery as our teams are continually working to mitigate the problem and are continuing to monitor.

Posted Dec 25, 2019 - 20:40 UTC

Investigating

From 11:21 AM PM PST, some customers may have experienced an elevated error rate for presence calls in all regions.

Posted Dec 25, 2019 - 19:47 UTC

This incident affected: Points of Presence (North America Points of Presence, European Points of Presence, Asia Pacific Points of Presence, Southern Asia Points of Presence) and Realtime Network (Presence Service).