Global timeouts and latency errors for Presence service
Incident Report for PubNub
Postmortem

RCA 05/29/19 Presence Redis Labs outage

10:20-11:02 AM PDT

Impact

Presence service experienced latency and timeouts in all regions.

Root Cause

Presence service DB recently rolled over to a new Redis DB cluster after which the timeouts were noticed. To remedy immediately, the DB was rolled back to the previous version where we noticed another failure in the roll-back process which increased the timeouts. The failure step in the process was immediately identified and fixed which resolved the issue bringing back the service to a healthy state.

Mitigation Steps and Recommended Future Preventative Measures

Mitigation Steps

The roll-back process will be revisited to avoid any failure steps in future.

Posted 4 months ago. May 31, 2019 - 20:13 UTC

Resolved
We have not seen any continuing issues since the solution was deployed. If you continue to experience issues related to this incident, please report to PubNub Support with details (sub-key, error messages, region, SDKs/versions, etc) and we will continue to troubleshoot.

An official RCA will be posted here as soon as possible.
Posted 5 months ago. May 29, 2019 - 19:00 UTC
Update
We are continuing to monitor for any further issues.
Posted 5 months ago. May 29, 2019 - 18:09 UTC
Monitoring
The root cause has been identified and a solution deployed. All timeout and latency issues should be resolved. The issue was related to a presence database upgrade when we noticed some alerts so we immediately rolled back. In the roll back there was a config that was not properly set that caused the continued issue. More details will be provided in an official RCA on this status page incident.
Posted 5 months ago. May 29, 2019 - 18:08 UTC
Identified
Engineering has identified the likely root cause. We will update with status when we know we have a solution.
Posted 5 months ago. May 29, 2019 - 17:51 UTC
Investigating
We are currently seeing timeout and latency issues related to the Presence service in all regions.
Posted 5 months ago. May 29, 2019 - 17:40 UTC
This incident affected: Points of Presence (North America Points of Presence, European Points of Presence, Asia Pacific Points of Presence, Latin America Points of Presence) and Realtime Network (Presence Service).