Southern Asia PoP may have experienced delays with messages using Storage

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution

The incident started at about 21:29 UTC (13:29 PST) on 2021-01-05. Due to extremely high CPU, messages were not being written to Storage for any publishes that occurred in our Mumbai PoP, however, Storage reads were successful for any data that persisted prior to the incident (with some latency). No data was lost, rather, it queued up until the writers were able to successfully catch up.

The resolution came when we restarted the Storage processes. The incident concluded at about 21:58 UTC (13:58 PST).

Mitigation Steps and Recommended Future Preventative Measures

We have updated the code to prevent the errors caused by deleted records in the distributed data storage.

Posted Jan 26, 2021 - 18:57 UTC

Resolved

This incident has been resolved.

Posted Jan 18, 2021 - 15:18 UTC

Monitoring

A fix has been implemented at 14:46 UTC, and we are monitoring the results for the next 30 mins.

Posted Jan 18, 2021 - 14:47 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Jan 18, 2021 - 14:32 UTC

Investigating

Starting around 13:42 UTC, customers in our Southern Asia PoP may have experience delays between the time messages were published and the time they were read/write to storage.

Posted Jan 18, 2021 - 14:21 UTC

This incident affected: Realtime Network (Storage and Playback Service) and Points of Presence (Southern Asia Points of Presence).