On October 17, 2025 at 04:51 UTC, some customers may have experienced elevated latency and error rates with the Pub/Sub service in the IAD region (US-East). Our engineering teams began immediate investigation and identified a spike in errors related to a recent update to the Pub/Sub service.
We began formal incident response and initiated rollback of the service deployment shortly thereafter. The issue was fully resolved by 06:50 UTC, and rollback across all regions was completed by 08:00 UTC.
The issue occurred because a misconfiguration in the release caused incorrect behavior in the channel cleanup logic. Additionally, our alerting configuration did not include coverage for the synthetic test failures that would have surfaced this issue sooner, delaying detection.
To prevent a similar issue from occurring in the future, our engineering teams have written a simpler and more reliable replacement for the faulty logic. That code is currently undergoing rigorous testing before being reintroduced in a future release.
We are also addressing the lack of proper alerting that contributed to a delayed response. Synthetic tests have been reviewed, and appropriate alerting will be implemented to ensure similar regressions are detected earlier. In parallel, we are updating our development and testing processes to catch such issues before code reaches production. Lastly, we are conducting a refresher training on our incident response process to ensure faster execution and coordination in the future.