Impact: Functions were partially down for some customers in some regions during the incident window.
Root Cause: A planned change was deployed post which some of the jobs failed to be rescheduled. As a result, some customers with production workloads lost server capacity causing unnecessary timeout events to be generated.
Resolution: To resolve the incident, the server was drained on each node one at a time and the jobs rescheduled on a different worker. With this, eight nodes were added to ensure that enough capacity exists in each impacted region. Once all nodes drained completely, the incident was resolved.