Update 18.11.2025 - After further investigation we want to update our preliminary Root Cause Analysis with additional insights
Update 24.11.2025 - We want to provide more concrete timelines for the action items
What happened:
On 14.11.2025 between 13:54 and 14:47 UTC we observed a network connectivity issue in our TXL datacenter, affecting multiple services and leading to increased latency and packet loss.
This happened because:
A datacenter cluster experienced network instability due to a broken failover behavior in the control plane of our InfiniBand (IB) fabrics. The core issue was a constant switching of Master/Standby roles between the redundant InfiniBand Subnet Managers, leading to inconsistencies in the fabric.
The root cause of this instability was identified as overly aggressive timeout and retry settings in the Subnet Manager’s High Availability (HA) configuration. These sensitive settings allowed a minor, otherwise insignificant latency spike to trigger an initial failover.
This created a self-perpetuating cycle: the resource overhead from the initial failover caused a spike in CPU load on the IB gateways. This increased load resulted in further latency, which the aggressive HA settings interpreted as a failure, immediately triggering another failover. This loop continued until the Subnet Manager election process stabilized, allowing the system to recover automatically.
What we are doing to prevent this from happening again:
We understand that the incident has had a negative impact on our customers and partners. We believe that these steps will make our setup more resilient and help avoid a similar incident in the future.