Functions are partially down

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution

Impact: Functions were partially down for some customers in some regions during the incident window.

Root Cause: A planned change was deployed post which some of the jobs failed to be rescheduled. As a result, some customers with production workloads lost server capacity causing unnecessary timeout events to be generated.

Resolution: To resolve the incident, the server was drained on each node one at a time and the jobs rescheduled on a different worker. With this, eight nodes were added to ensure that enough capacity exists in each impacted region. Once all nodes drained completely, the incident was resolved.

Posted Aug 14, 2019 - 05:08 UTC

Resolved

This incident is resolved and no further function errors should occur. An RCA will be provided early next week.

Posted Aug 08, 2019 - 23:21 UTC

Update

We are continuing to monitor for any further issues.

Posted Aug 08, 2019 - 22:57 UTC

Monitoring

Engineering has completed all deployments and we are transitioning in the monitoring mode for the next 30 minutes. If no further errors in that time we will resolve this incident.

Posted Aug 08, 2019 - 22:56 UTC

Update

After a full round of deployments to address the issues we are not yet in full resolution although we have seen a major reduction in errors. We expect to be in full resolution soon and moving into monitoring status.

Posted Aug 08, 2019 - 22:21 UTC

Update

We continue to make progress with updating the nodes in our PoPs. Still a bit more time to go before full resolution.

Posted Aug 08, 2019 - 21:36 UTC

Update

Continuing to work towards resolution as we target each of our PoPs and we see error rates drop. More updates as we make progress will be posted.

Posted Aug 08, 2019 - 20:58 UTC

Identified

We have identified a failure schedule some event handlers in functions. We are actively working towards resolution of the pending jobs.

Posted Aug 08, 2019 - 20:26 UTC

Investigating

Some functions for some customers may fail to execute. Engineering is currently working to resolve the issue.

Posted Aug 08, 2019 - 20:23 UTC

This incident affected: Functions (Functions Service).