Some Customers experiencing elevated API Errors for PubNub Presence

Incident Report for PubNub

Postmortem

Problem Description: Starting at 02:20 PDT Feb 21, the PubNub Presence service had a sharp increase in errors (50x) and/or increased latencies. The incident was experienced in all regions, though at different frequencies/severities. The incident spanned four discrete time frames ranging in duration from 10 minutes to 30 minutes, completely resolving at 11:00 PDT on Feb 21. The incident management team responded immediately to the alert and worked through multiple phases of issue identification, mitigation, resolution, and root cause analysis. Many of the attempts at mitigation showed promising results, restoring the service at various points (as illustrated in the Impact chart below) during the incident.

Root Cause: The root cause was difficult to identify, as several distinct, seemingly unrelated, triggering events caused back pressure on the data layer, eventually causing timeouts on a code layer that runs on the data layer. Error messages from our data layer provider were not sufficient to identify the specific module causing the timeouts. After escalations and investigation, we identified the end-users sending JSON objects without constraints causing the script to fail, indeed after the evaluation fix was deployed.

Impact: Between 02:20 PDT and 11:00 PDT, there were four timeframes during which the Presence service was degraded. The incident was experienced as (a) Accuracy of presence counts, (b) an increase in 50x errors, and/or (c) increased latencies in Presence API calls returning. The specific timeframes are represented in the graph below:

‌

The duration of the timeframes ranged from ~10 minutes to ~30 minutes.

Mitigation Steps and Recommended Future Preventative Measures:

The main reason for the long time to fully restore was a lack of telemetry from the data layer, preventing a better understanding of the root cause and direct location of the backpressure. Preventative measures were put in place:

Creating a process to provide the ability we give to end-users without limits and constraints.
Working with our data layer provider, we were provided better telemetry of poorly performing modules.
We created an updated playbook for how to detect and address any similar issues.

Posted Feb 27, 2020 - 01:00 UTC

Resolved

This incident has been resolved.

Posted Feb 21, 2020 - 22:04 UTC

Update

We are continuing to monitor for any further issues.

Posted Feb 21, 2020 - 19:20 UTC

Monitoring

The root cause has been identified and a solution deployed, now we are monitoring the results.

Posted Feb 21, 2020 - 19:15 UTC

Update

We are continuing to work on a fix for this issue.

Posted Feb 21, 2020 - 16:33 UTC

Update

We have been able to route around the problem and the error rates have dropped but we are still working on fixing the underlying problem and will keep this incident open until we do so.

Posted Feb 21, 2020 - 14:51 UTC

Identified

The issue has been identified. While we continue to monitor we observed an elevated error rate of Presence error during 04:17 AM to 04:32 AM PST

Posted Feb 21, 2020 - 12:45 UTC

Update

We are continuing to work on a fix for this issue.

Posted Feb 21, 2020 - 12:00 UTC

Investigating

Some customers may have experienced an elevated error rate for presence calls in all regions. Similar errors were seen started at 02:28 AM PST.

Posted Feb 21, 2020 - 11:50 UTC

This incident affected: Points of Presence (North America Points of Presence, European Points of Presence, Asia Pacific Points of Presence, Southern Asia Points of Presence).