Preliminary Root Cause Analysis
Today we are releasing a preliminary Root Cause Analysis that represents the current state of the investigation focusing on the network incident. While we have high confidence in the technical details concerning the network incident, we will add to this analysis as new information becomes available related to the services affected by the spillover effects of this main incident. We expect to publish the complete RCA by the end of this week.
What happened?
During a scheduled maintenance window to replace a faulty switch in a cluster situated in the TXL data center, a network outage occurred.
The incident triggered a spillover effect, resulting in a significant queue of unprocessed provisioning jobs from the affected cluster. Following network recovery, this backlog generated significant resource locking delays, slowing processing of queued jobs. This impacted systems and services relying on self-healing automations triggered by the primary incident or undergoing changes at that time.
How was this possible? (Technical Root Cause)
The maintenance took place in Topology 2 of 2 in the affected cluster. During the maintenance, Topology 2 of 2 was disabled as planned. At this point, the Multicast Forwarding Table (MFT) capacity across all switches in Topology 1 was already strained due to saturation from IPoIB multicast groups. Due to this, Topology 1 was already operating in a vulnerable state.
Topology 1 of 2 then suffered a critical gateway overload, initiated by a link flap on a switch which flooded the InfiniBand control plane with alerts, triggering a fabric resweep - an automated self-healing mechanism aiming to reset the network map. Because the MFT was already at capacity, the reprogramming required for the reset failed. The resulting surge of unsuccessful route updates overwhelmed the management layer, causing BGP sessions to drop and ultimately severing connectivity for hosts in the affected cluster.
As the connectivity loss made it impossible for provisioning jobs to succeed in the affected cluster all changes scheduled for resources on the affected hosts were queued.
The network incident was resolved by a switchover from the affected Control Plane to its standby, a reboot of the locked-up gateways, and recovery of Topology 1 from the scheduled maintenance.
Although the network incident was resolved, the resulting volume of provisioning jobs triggered significant resource locking. This bottleneck delayed critical updates, leaving several (self-healing) services in a degraded state as they waited for their queued jobs to process.
What are we doing to prevent recurrence?
While we are still working on compiling all action items for the affected services, we want to share the following items already:
We remain committed to transparency and we thank you for your patience that allows us to continue our investigation with due diligence.