Elevated History Error/Latency in US West Region

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution

At 07:23 UTC on December 9th, we received alerts indicating high error levels related to storage writer operations in one of our data centers. Shortly after, one of our third-party service providers reported service disruption in their environment. The service provider began the process of replacing the affected nodes. Throughout the restoration process, we closely monitored our systems to assess how the issue impacted our environment. Once all nodes were successfully restored, error levels returned to normal, and all associated alerts were resolved.

Mitigation Steps and Recommended Future Preventative Measures

We have worked with our vendor to ensure that nodes of this type will be on redundant infrastructure going forward, so that there is less exposure to this kind of incident.

Posted Dec 16, 2024 - 23:17 UTC

Resolved

This incident has been resolved with no errors observed for the last 30 minutes. We apologize for any impact this may have had on your service. Don't hesitate to contact us by reaching out to PubNub Support (support@pubnub.com) if you wish to discuss the impact on your service. An RCA will be provided soon.

Posted Dec 09, 2024 - 10:00 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Dec 09, 2024 - 09:28 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Dec 09, 2024 - 08:55 UTC

Investigating

Around 07:23 UTC we began to notice increasing errors and latency for History in SJC region(US West).

Posted Dec 09, 2024 - 08:42 UTC

This incident affected: Points of Presence (North America Points of Presence) and Realtime Network (Storage and Playback Service).