Incident date and time: 6/14/2018, Intermittently between 00:21 and 01:36 UTC
Affected Services: PubNub Access Manager (resulting in difficulty reaching downstream services)
Problem Description, Impact and Resolution:
During the time window of the incident there were three events causing latencies and errors. Two of the events where the same type and occured from 00:21 to 00:36 and 01:21 to 01:36 UTC in the US-EAST PoP. These events experienced elevated latencies for publish operations. The cause was an increase in traffic from a narrow IP range that exposed a load distribution bug so that these requests were improperly routed causing the additional latency. These events were resolved by rate limiting the increased requests until routing algorithm could be corrected.
The other independent event occured from 01:13 to 01:29 UTC in an EU PoP. During this event customers may have experienced high latency and error rates across all PubNub services. A failure in our service providers network caused the PubNub Access Manager (PAM) service to have a high failure rate authenticating client permissions. The incident was resolved by mitigating the failure with capacity and traffic shaping until our service provider restored the network link.
Mitigation Steps and Recommended Future Preventative Measures:
For the first type of incident we are deploying new code to fix the traffic distribution bug and expect this problem to be eliminated in the future. Additionally, we are always actively trying to improve our operations to provide as much flexibility and robustness around traffic shaping and mitigation.
For the EU incident we are exploring more availability failover options to mitigate permission lookup failures. This will include an additional failover in the region for all authentication permissions.