Incident date: 6/26/2018
Affected Services: MQTT Gateway
Problem Description, Impact and Resolution:
The MQTT gateway experienced a hardware failure in the data configuration caching layer. Our automatic failover detection was able to promote new hardware to an operational state. Unfortunately, the MQTT application failed to properly and consistently failover to the new reliably hardware due to a bug in the service discovery mechanism. We were able to deploy patched service discovery code and resolve the incident.
The result was intermittent connectivity to the MQTT gateway for some account keys, a consistent inability to maintain connections that had been established (resulting in the need to re-connect), and/or the inability to connect consistently, depending on the account key and how/where its data configuration was stored.
Mitigation Steps and Recommended Future Preventative Measures:
We have added tests to prevent this service discovery failure in the future. Additionally, as we strive to always provide the most reliable and available service we take this incident very seriously as our monitoring of the MQTT Gateway service failed to notify us properly of the impacted services. We are actively taking steps to overhaul the monitoring of the MQTT Gateway service to achieve our expected reliability.