This is our Root Cause Analysis. We will update the RCA as we complete our investigation, inventory, and planning of repair items.
Update 19.11.2025 - Expanding on RTRS connection loss as contributing factor
Update 11.12.2025 - Adding repair items related to network upgrades and their ETAs
What Happened?
A defective Host Channel Adapter (HCA) caused instability in the InfiniBand fabric starting at 2025-11-17 22:08:37 UTC. While this initial disruption was relatively brief, these instabilities (specific, frequent, simultaneous reconnect attempts over a short period of time) severed storage connections and triggered a bug in a software package used on our storage servers, causing an exception and triggering reboots across several storage servers at the same time. The loss of connectivity combined with these reboots caused loss of storage connectivity across VMs in the affected cluster.
How was that possible?
While the initial network disruption was relatively brief, it severed RTRS (RDMA (Remote Direct Memory Access) Transport) connections to some storage servers. It also triggered a bug in a software package used in our storage servers. This caused several storage servers to reboot unexpectedly, which decreased redundancy and increased the likelihood of volumes becoming unavailable to VMs. The HCA hardware fault, combined with the severed RTRS sessions and the unexpected reboot of several storage servers, led to a loss of redundancy and connectivity and has been identified as the technical root cause of this incident. While the infrastructure could be recovered relatively quickly, recovering the affected VMs and services took longer than expected. We are currently investigating the exact reason for the extended recovery duration.
How did we respond?
The InfiniBand ports of the problematic server were shut down, successfully isolating the defective HCA. The fabric topologies reconverged quickly. The resulting storage connectivity issues needed to be addressed by re-attaching storage, then re-establishing redundancy for the storage, and reprovisioning/resetting affected VMs.
What are we doing to prevent this from happening again?
While the investigation is still ongoing, we have already inventoried the following actions. We will assign ETAs to them as soon as investigation and planning is finished.
Hardware & Fabric Resilience
Storage Stability
Improved Recovery
As we continue our investigation, we will update and add to this Root Cause Analysis. We would like to thank your for your patience and teamwork over the last hours.