At 21:50 UTC on 2024-12-11 we observed increased latencies and error rates across all services in our US-East point-of-presence and, a few minutes later, in US-West as well. We observed that the PubNub Access Manager (PAM) was at the center of the degradation, and an investigation noted that nodes in that service were highly memory constrained. We increased capacity and the issue was mitigated in both points-of-presence at 22:10 UTC, and declared resolved at 22:22 UTC. This issue occurred because a previously unseen pattern of customer behavior overwhelmed a cache in the PAM system, causing memory to become constrained and performance to degrade.
To prevent a similar issue from occurring in the future we changed the cache capacity and updated our monitoring to alert on this and similar patterns of behavior.