Functions are experiencing issue with execution

Incident Report for PubNub

Postmortem

We lost one of our service discovery nodes in our Asia Pacific Point of Presence due to an instance failure with one of our providers. During the failure the cluster got into a bad state due to a bug in the software. Our functions service is dependent on this being available so it was affected. To fix it we needed to restart the server and the clients on a number of our nodes, which took time in order to do safely. To ensure this doesn’t happen again we are going to apply a fix for the bug that caused the cluster failure.

Posted Mar 21, 2019 - 19:17 UTC

Resolved

We continue to see success after monitoring for a time. This incident is resolved. A Root Cause Analysis will be posted.

Posted Mar 20, 2019 - 16:41 UTC

Update

We are continuing to monitor for any further issues.

Posted Mar 20, 2019 - 13:18 UTC

Update

We have received confirmation that this resolution has been effective and we will monitor for the next few hours before we set this to resolved status.

Posted Mar 20, 2019 - 13:18 UTC

Monitoring

A solution has been deployed and we are proactively restarting any functions that were affected by this incident which will resolve the issue.

Posted Mar 20, 2019 - 13:02 UTC

Update

Update: Any functions that utilize KV store or XHR calls would have intermittent failures. We continue to be "all hands" on this incident working on a resolution as soon as possible.

Posted Mar 20, 2019 - 12:44 UTC

Identified

We have a service discovery mechanism in the region that is supposed to have redundancy which has failed. A solution has been identified and configuration recovery is taking longer than expected.

Posted Mar 20, 2019 - 11:39 UTC

Investigating

No further details to share at this point. We will respond with more ASAP

Posted Mar 20, 2019 - 10:42 UTC

This incident affected: Functions (Functions Service).