MQTT Gateway Issues
Incident Report for PubNub
Postmortem

Incident date: 6/26/2018

Affected Services: MQTT Gateway

Problem Description, Impact and Resolution:

The MQTT gateway experienced a hardware failure in the data configuration caching layer. Our automatic failover detection was able to promote new hardware to an operational state. Unfortunately, the MQTT application failed to properly and consistently failover to the new reliably hardware due to a bug in the service discovery mechanism. We were able to deploy patched service discovery code and resolve the incident.

The result was intermittent connectivity to the MQTT gateway for some account keys, a consistent inability to maintain connections that had been established (resulting in the need to re-connect), and/or the inability to connect consistently, depending on the account key and how/where its data configuration was stored.

Mitigation Steps and Recommended Future Preventative Measures:

We have added tests to prevent this service discovery failure in the future. Additionally, as we strive to always provide the most reliable and available service we take this incident very seriously as our monitoring of the MQTT Gateway service failed to notify us properly of the impacted services. We are actively taking steps to overhaul the monitoring of the MQTT Gateway service to achieve our expected reliability.

Posted Jun 28, 2018 - 15:00 UTC

Resolved
We are resolving this incident.
Posted Jun 26, 2018 - 19:38 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jun 26, 2018 - 18:28 UTC
Update
We have identified the issue as a malfunctioning data cache configuration and implemented a fix. We are actively monitoring the service.
Posted Jun 26, 2018 - 17:48 UTC
Investigating
We are currently investigating a problem with our MQTT gateway. We will update the status page as we get more information.
Posted Jun 25, 2018 - 23:45 UTC
This incident affected: Realtime Network (MQTT Gateway).