Increased global errors and latency for all services

Incident Report for PubNub

Postmortem

Beginning on Friday, June 27, 2025 at 08:15 UTC, there were occasional, intermittent increases in latency and errors in three of our services: Pub/Sub, History, and Presence. The root cause discussed in this analysis was identified and corrected on Monday, June 30.

Problem Description, Impact, and Resolution 

Recently, to ensure PubNub had access to more cloud server capacity across our many regions, we introduced new instance types to our system to provide a more heterogeneous set of instance types on which PubNub’s services run. Over time, PubNub has created many OS/kernel-level configurations to optimize the performance of each server. However, with the more heterogeneous instance types, an underlying setting that we were explicitly specifying, which controls limits on network connectivity, was being silently overridden by our upstream load balancers. When we introduced the new instance types, they would reach connectivity limits. Unfortunately, the errors we initially encountered pointed us in incorrect directions, causing the investigation to take longer than we normally strive for.

The issue was mitigated once we identified this issue and configured the affected services to run on other instance types and launched more capacity.

Mitigation Steps and Recommended Future Preventative Measures 

To prevent recurrence, we modified the new instance types to emit metrics related to these OS thresholds and limits, enabling us to detect when these limits are approached or exceeded, regardless of instance type. This change allows us to scale proactively and properly route traffic based on instance type, ensuring we are more dynamic in heterogenous instance type deployment configuration. 

Again, we apologize for the incidents outlined above and are committed to maintaining transparency when issues affect our customers. Should you have any questions regarding this analysis, please reach out to our support team at support@pubnub.com.

Posted Jul 03, 2025 - 18:50 UTC

Resolved

On June 29, 2025 around 16:00 UTC we began seeing elevated latency and increased errors globally across all services. Our Engineers scaled-up resources and PubNub services returned to normal levels by 16:28 UTC.
Posted Jun 29, 2025 - 17:15 UTC