Issue with messages missing in Storage & Playback in US West PoP

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution

At 12:00 UTC on May 3, 2022 we observed latency in our Storage & Playback service in the US West PoP, which manifested itself as missing messages to clients who used that service to look up messages in that location. Publish and Subscribes were unaffected. We identified the cause as an issue with a downstream data storage service provider in that region, and took steps to have other regions assist in processing the message backlog. This caused the secondary effects of temporarily increasing latency and error rates in the Storage & Playback and Push services in the US East and AP Northeast PoPs from 13:42 to 13:54, after which all services were operating nominally. There was an additional secondary effect which manifested as increased latency to the Push service from 14:25 to 14:36. All systems were then performing within normal bounds, and the incident was considered resolved at 14:36 UTC the same day.

This issue occurred because of a failure of 2 of 3 database nodes at a database provider in the US West region. The provider completed the replacement of the failed nodes at 18:00 UTC the same day, after which we returned to our normal operating posture for the affected services.

Mitigation Steps and Recommended Future Preventative Measures

To help minimize the impact of a similar issue in the future, we updated our operational runbooks for dealing with a regional database failure based on some of our observations during this incident. We noted the secondary effects to the Push system that were caused by the runbook used to route around the issue by bringing other regions’ capacity to assist, and have scheduled work to prevent that kind of effect in the case of another similar procedure.

We are continuing to work with our database provider to analyze the root cause in their service, and mitigate that going forward.

Posted May 07, 2022 - 20:58 UTC

Resolved

We are resolving this issue, and we will follow up with a post-mortem soon.

We apologize for any impact this may have had on your service. Please reach out to us by contacting PubNub Support (support@pubnub.com) if you wish to discuss the impact on your service.

Posted May 03, 2022 - 15:35 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 03, 2022 - 14:22 UTC

Update

A workaround for Storage & Playback has been implemented. We are monitoring for additional issues as the downstream database provider works on the issue.

Posted May 03, 2022 - 14:22 UTC

Update

Other PoPs have been seeing increased errors and latency for the past few minutes as they assist in the backlog from US West.

Posted May 03, 2022 - 13:51 UTC

Identified

The Storage & Playback service is degraded in the US West PoP due to a database issue at a downstream service provider. We are working to use resources in other PoPs to help US West.

Posted May 03, 2022 - 13:45 UTC

Investigating

Users in our US West PoP are currently experiencing an issue with our Storage & Playback service that is causing some messages to not appear in calls to that service. No errors are being returned from the service, despite messages not being returned correctly. The other services, including PubSub, are working normally. We are investigating the issue.

Posted May 03, 2022 - 13:15 UTC

This incident affected: Points of Presence (North America Points of Presence, Asia Pacific Points of Presence, Southern Asia Points of Presence) and Realtime Network (Storage and Playback Service).