Problem Description: Starting at 06:20 UTC Dec 25, the PubNub Presence service had a sharp increase in errors (50x) and/or increased latencies. The incident was experienced in all regions, though at different frequencies/severities. The incident spanned six discrete timeframes ranging in duration from 20 minutes to 3:40 hours, completely resolving at 01:06 UTC on Dec 26. The incident management team responded immediately to the alert and worked through multiple phases of issue identification, mitigation, resolution, and root cause analysis. Many of the attempts at mitigation showed promising results, restoring the service at various points (as illustrated in the Impact chart below) during the incident. While temporarily successful, the final effort towards resolution was not identified until the final timeframe, when the team was able to restore the service completely to it’s normal running state.
Root Cause: The root cause was difficult to identify, as a number of distinct, seemingly unrelated, triggering events caused back pressure on the data layer, eventually causing timeouts on a code layer that runs on the data layer (designed to batch and minimize the round-trip calls between various Presence layers and the data layer). Error messages from our data layer provider were not sufficient to identify the specific module causing the timeouts. After escalations and investigation with our data layer provider we were able to identify and tune the module more effectively, also adding more telemetry to allow us to better alert, identify, tune, and resolve any future similar incidents.
Impact: Between 06:20 UTC Dec 25 and 1:06 UTC Dec 26, there were six timeframes during which the Presence service was degraded. The incident was experienced as (a) Accuracy of Presence counts, (b) an increase in 50x errors, and/or (c) increased latencies in Presence API calls returning. The specific timeframes are represented in the graph below:
The duration of the timeframes ranged from ~15 minutes to ~3:30 hours.
Resolution: There were various attempts at resolution, each providing temporary restoration of the service. These involved switching the service to an alternative “hot” data layer while the primary data layer could be restored. As indicated on the chart above, the switches between the data layers is evidenced by the immediate restoration of service. Unfortunately, the redundant data layers would each begin exhibiting the same issues, requiring a constant switching of data layers, and restoration of the affected data layer. The final resolution involved identifying the module causing the time outs, and tuning that module across all redundant Presence data layers.
Mitigation Steps and Recommended Future Preventative Measures:
The existence of multiple, synchronized data layers was able to provide some redundancy and temporary resolution, however only a tuning of the entire system together was able to completely relieve the backpressure. The main reason for the long time to fully restore was a lack of telemetry from the data layer, preventing a better understanding of root cause and direct location of the back pressure. Three preventative measures were put in place: