Starting at around 21:25 UTC on 2021-01-26 published messages made in our US-West PoP were failing to trigger Functions and affected the ability to change Vault key values via Portal, as well. The incident was triggered when a database used to register Functions reached a size that unexpectedly degraded performance, causing cascading effects in the systems used to trigger Functions.
We routed publishes from the affected PoP to US-East to mitigate the impact so that all Functions were being triggered at 21:49 though Vault was still not functioning properly. At 22:28 Vault began to work again so all services were restored, though running out of the US-East PoP. After restarting processes in US-West, all services were restored and the issue was fully resolved at 22:48 UTC.
Mitigation Steps and Recommended Future Preventative Measures
To prevent a similar issue from occurring in the future we are proactively managing the size of the databases that could be impacted by the size threshold that was uncovered by the incident. We have also added items to our backlog to alter the dependencies on the existing data storage approach for registering Functions.