The incident started at about 21:29 UTC (13:29 PST) on 2021-01-05. Due to extremely high CPU, messages were not being written to Storage for any publishes that occurred in our Mumbai PoP, however, Storage reads were successful for any data that persisted prior to the incident (with some latency). No data was lost, rather, it queued up until the writers were able to successfully catch up.
The resolution came when we restarted the Storage processes. The incident concluded at about 21:58 UTC (13:58 PST).
We have updated the code to prevent the errors caused by deleted records in the distributed data storage.