Root Cause Analysis
Between 11 and 12 May 2026, a cluster in our FKB datacenter was affected by two consecutive storage hardware failures, which led to service degradation and I/O outage for affected workloads spanning over 15 hours and limiting provisioning functionality in affected VDCs.
In the following Root Cause Analysis, we explain the incident, identify the technical triggers of the outage, describe the work done during mitigation, and outline the measures we are taking to prevent a recurrence.
Technical Context
Because the failure mode was a cascading sequence of two hardware events on a redundant storage pair, we want to provide the technical context necessary to understand how a redundant system could be affected to this degree:
To provide a resilient service, storage servers in the affected clusters are deployed in redundant pairs (Leg A and Leg B) that hold synchronized data. Each leg itself has built-in redundancy and fault tolerance - for example, via RAID configurations for disks.
The two legs are designed to mitigate risk from hardware failures or issues that would render the resilience mechanisms of a single leg ineffective. Failure of one leg is largely transparent to the VM host, which - for customers - continues to operate normally.
In a "Zero Leg" event - where connectivity to both legs is lost - VMs served by the affected storage experience an outage. This is the event that caused this incident.
What Happened
On 11 May 2026 , the first of two servers (Leg A) in a redundant storage backend pair began reporting multiple disk errors in its RAID array. The error pattern - UDMA CRC errors across several disks - was consistent with degradation of a shared physical interconnect (SAS backplane or cabling), rather than wear on any single disk. From this point onward, all dependent storage volumes were operating without their secondary failover path, served exclusively from Leg B - a single-leg event.
As a forced array reassembly was initiated additional disks faulted during the rebuild in the afternoon, degrading the array below the recoverable threshold. A subsequent reboot brought the array up fully inactive, and Leg A was considered unrecoverable. The decision was made to begin migrating redundancy onto alternative storage systems, with the goal of restoring two-leg operation. Due to the volume of data that needed to be synchronized, this process was expected to take several hours.
On 12 May at 05:13 UTC - while synchronization was still in progress - Leg B experienced an independent disk disconnection in its own RAID array. With Leg A already non-functional, this second failure eliminated all remaining redundancy in the pair. All dependent volumes dropped to zero available copies simultaneously, causing an I/O outage for the affected virtual machines.
Leg B was rebooted at 05:38 UTC and its array force-assembled with the remaining healthy disks, which allowed a first wave of volume recovery to complete by 08:25 UTC. Additional disk failures on the same server then materialized at 10:59 UTC, triggering a second outage. At this point, the decision was taken to perform a complete server replacement: at approximately 14:30 UTC, all disks were physically transferred to new server hardware and the RAID arrays were rebuilt on the replacement system.
The hardware swap introduced an additional recovery complication: the replacement server now carried a different storage network identity. Compute nodes across the cluster still held cached connection sessions pointing to the original hardware identity, and VM restart attempts subsequently failed. When a cleanup of stale connection mappings across affected compute nodes in the fleet was performed existing automation tooling proved ineffective. This significantly hampered recovery efforts. Leg B was confirmed fully operational at 20:45 UTC, and affected VMs began recovering. VMs where automatic recovery was not possible were remediated manually in the following hours.
During the subsequent synchronization process to re-establish redundancy between the new legs, the provisioning service was deactivated in VDCs that contained VMs served by the affected storage pair. This meant customers were temporarily unable to roll out changes to affected VMs. Storage redundancy was fully restored on 15 May, once all affected volumes had been migrated back to a fully redundant configuration. At this point, the service was fully recovered.
How was that possible?(Root Cause)
The incident was caused by two independent hardware failures on the two servers of a single redundant storage pair, occurring several hours apart. Each failure on its own would not have caused customer impact. The combination of both, within the window between the first failure and the completion of redundancy migration, eliminated the redundancy that the architecture is designed to provide.
Primary failure - Leg A. The errors observed on Leg A were consistent with degradation of a shared physical interconnect - a SAS backplane or cabling issue - rather than end-of-life wear on individual disks. This pattern explains both why multiple disks failed in a correlated way, and why repair attempts could not outpace degradation: the rebuild operations themselves exercise the interconnect, and a second disk faulted under rebuild load before the array could complete. A stale array member prevented full reassembly, and the server was effectively out of service after its reboot.
Secondary failure - Leg B. While migration of redundancy to alternative storage was in progress, Leg B suffered an independent disk disconnection. The exact root cause of this second failure is still under investigation, but the symptoms were consistent with previously observed wear-related failures within the same hardware generation and make.
Recovery complication - network identity mismatch. When the disks from Leg B were transferred into a replacement chassis, the new server presented a different storage network identity. Compute nodes in the cluster, however, retained cached connection sessions pointing to the original hardware identity. VM restart attempts therefore failed against the replacement hardware until those sessions were invalidated across every affected compute node. As available automation tooling proved ineffective specialized tooling needed to be developed and tested during the incident.
In summary: a single-leg operation window - opened by the first hardware failure - coincided with an independent second hardware failure on the surviving leg before redundancy could be restored. The recovery path then required full hardware replacement, which surfaced inadequate automation in the storage-to-compute identity handover which prolonged the recovery process significantly.
What we are doing to prevent recurrence
We have grouped the remediation into completed tasks, immediate measures, and a mid-term reviews to address the gaps that turned a localized hardware failure into a prolonged customer-impacting outage.
Already taken:
Immediate (June 2026):
Mid-term (within Q3):
Closing Remarks
We recognize that this incident produced a full I/O outage for the virtual machines in environments that depended on this storage pair, lasting over 15 hours, and that volumes operated without their secondary failover path for several hours before and after the incident. For any production workload, an outage and loss of redundancy of this duration is significant and unacceptable.
This incident not only resulted in service disruption and subsequent degradation, but also led to significant effort for partners and customers working on their side to mitigate the resulting outage.
We believe that the measures we have completed and have planned will effectively reduce the risk of a similar failure pattern. The planned review will help to identify and reduce risk factors further. The assessment of recovery tooling will help us recover quicker and increase the robustness of our incident response.
Thank you for your patience during the outage and recovery and your engineering teams' constructive coordination and support throughout this event.