This was a re-emergence of the Functions incident from 1/26 after the root cause was misidentified following the prior incident. Starting at around 01:37 UTC on 2021-01-28 some Functions were failing to start in our US-West PoP and published messages were failing to trigger those Functions, though all messages to Functions that had already been running performed correctly. The incident was triggered when a database used to register Functions reached a size that unexpectedly degraded performance, causing cascading effects in the systems used to trigger Functions.
We routed traffic around the US-West PoP to mitigate the impact so that all Functions were being triggered at 03:10 at which point the incident was fully resolved.
Mitigation Steps and Recommended Future Preventative Measures
To prevent a similar issue from occurring in the future we are proactively managing the size of the databases that could be impacted by the size threshold that was uncovered by the incident. We have also added items to our backlog to alter the dependencies on the existing data storage approach for registering Functions.