At 14:35 UTC on May 18, 2023 we observed some errors being served to subscribers globally. We noted a large, unusual traffic pattern that was putting memory pressure on parts of our infrastructure faster than our normal autoscaling could handle. We resolved the issue by manually adding capacity to cover the newly observed pattern. The issue was resolved at 16:15 UTC the same day. This issue occurred because the system was not prepared to scale quickly enough on the combination of factors that were unique to this traffic.
To prevent a similar issue from occurring in the future we are adding new monitoring and alerting that can detect this scenario, as well as tuning scaling factors in our systems to allow our autoscaling to react more appropriately to it.