Subscribers experiencing errors in all regions

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution

At 14:35 UTC on May 18, 2023 we observed some errors being served to subscribers globally. We noted a large, unusual traffic pattern that was putting memory pressure on parts of our infrastructure faster than our normal autoscaling could handle. We resolved the issue by manually adding capacity to cover the newly observed pattern. The issue was resolved at 16:15 UTC the same day. This issue occurred because the system was not prepared to scale quickly enough on the combination of factors that were unique to this traffic.

Mitigation Steps and Recommended Future Preventative Measures

To prevent a similar issue from occurring in the future we are adding new monitoring and alerting that can detect this scenario, as well as tuning scaling factors in our systems to allow our autoscaling to react more appropriately to it.

Posted May 23, 2023 - 23:10 UTC

Resolved

We are resolving this issue, and we will follow up with a post-mortem soon.

We apologize for the impact this may have had on your service. Please reach out to us by contacting PubNub Support (support@pubnub.com) if you wish to discuss the impact on your service.

Posted May 18, 2023 - 18:24 UTC

Monitoring

We have been operating with all systems normal for more than an hour. We are monitoring the situation at this point and investigating the root cause.

Posted May 18, 2023 - 17:38 UTC

Identified

We have identified the issue, and are still investigating further. All systems are operational.

Posted May 18, 2023 - 17:17 UTC

Investigating

Subscribers were experiencing sporadic errors on May 18 between 2:35 PM and 4:11 PM UTC. We are investigating the cause.

Posted May 18, 2023 - 16:48 UTC

This incident affected: Realtime Network (Publish/Subscribe Service) and Points of Presence (North America Points of Presence, European Points of Presence, Asia Pacific Points of Presence, Southern Asia Points of Presence).