On Tuesday, March 21, 2023 at 09:08 UTC, we observed errors, timeouts, and delays in message delivery in our EU Central PoP. We rolled back the responsible configuration changes, and the issue was resolved at 09:38 UTC. This issue occurred due to a configuration change that allowed our subscribe service to use the existing resources better. Unfortunately, this caused us to hit a limit in open connection counts, leading to delays in creating new connections. This, in turn, led to delayed subscribe call connections and message delivery. This was the same issue that occurred on March 16th. Unfortunately, it is particularly difficult to measure the customer impact of the subscribe API.
To prevent a similar issue from occurring in the future, there is a metric we can use to approximate customer impact that we will monitor closely going forward, including during any further configuration changes.