At 12:00 UTC on May 3, 2022 we observed latency in our Storage & Playback service in the US West PoP, which manifested itself as missing messages to clients who used that service to look up messages in that location. Publish and Subscribes were unaffected. We identified the cause as an issue with a downstream data storage service provider in that region, and took steps to have other regions assist in processing the message backlog. This caused the secondary effects of temporarily increasing latency and error rates in the Storage & Playback and Push services in the US East and AP Northeast PoPs from 13:42 to 13:54, after which all services were operating nominally. There was an additional secondary effect which manifested as increased latency to the Push service from 14:25 to 14:36. All systems were then performing within normal bounds, and the incident was considered resolved at 14:36 UTC the same day.
This issue occurred because of a failure of 2 of 3 database nodes at a database provider in the US West region. The provider completed the replacement of the failed nodes at 18:00 UTC the same day, after which we returned to our normal operating posture for the affected services.
To help minimize the impact of a similar issue in the future, we updated our operational runbooks for dealing with a regional database failure based on some of our observations during this incident. We noted the secondary effects to the Push system that were caused by the runbook used to route around the issue by bringing other regions’ capacity to assist, and have scheduled work to prevent that kind of effect in the case of another similar procedure.
We are continuing to work with our database provider to analyze the root cause in their service, and mitigate that going forward.