Availability of storage service partially limited in Karlsruhe

Incident Report for IONOS Cloud

Postmortem

Root Cause Analysis

Between 11 and 12 May 2026, a cluster in our FKB datacenter was affected by two consecutive storage hardware failures, which led to service degradation and I/O outage for affected workloads spanning over 15 hours and limiting provisioning functionality in affected VDCs.

In the following Root Cause Analysis, we explain the incident, identify the technical triggers of the outage, describe the work done during mitigation, and outline the measures we are taking to prevent a recurrence.

Technical Context

Because the failure mode was a cascading sequence of two hardware events on a redundant storage pair, we want to provide the technical context necessary to understand how a redundant system could be affected to this degree:

To provide a resilient service, storage servers in the affected clusters are deployed in redundant pairs (Leg A and Leg B) that hold synchronized data. Each leg itself has built-in redundancy and fault tolerance - for example, via RAID configurations for disks.

The two legs are designed to mitigate risk from hardware failures or issues that would render the resilience mechanisms of a single leg ineffective. Failure of one leg is largely transparent to the VM host, which - for customers - continues to operate normally.

In a "Zero Leg" event - where connectivity to both legs is lost - VMs served by the affected storage experience an outage. This is the event that caused this incident.

What Happened

On 11 May 2026 , the first of two servers (Leg A) in a redundant storage backend pair began reporting multiple disk errors in its RAID array. The error pattern - UDMA CRC errors across several disks - was consistent with degradation of a shared physical interconnect (SAS backplane or cabling), rather than wear on any single disk. From this point onward, all dependent storage volumes were operating without their secondary failover path, served exclusively from Leg B - a single-leg event.

As a forced array reassembly was initiated additional disks faulted during the rebuild in the afternoon, degrading the array below the recoverable threshold. A subsequent reboot brought the array up fully inactive, and Leg A was considered unrecoverable. The decision was made to begin migrating redundancy onto alternative storage systems, with the goal of restoring two-leg operation. Due to the volume of data that needed to be synchronized, this process was expected to take several hours.

On 12 May at 05:13 UTC - while synchronization was still in progress - Leg B experienced an independent disk disconnection in its own RAID array. With Leg A already non-functional, this second failure eliminated all remaining redundancy in the pair. All dependent volumes dropped to zero available copies simultaneously, causing an I/O outage for the affected virtual machines.

Leg B was rebooted at 05:38 UTC and its array force-assembled with the remaining healthy disks, which allowed a first wave of volume recovery to complete by 08:25 UTC. Additional disk failures on the same server then materialized at 10:59 UTC, triggering a second outage. At this point, the decision was taken to perform a complete server replacement: at approximately 14:30 UTC, all disks were physically transferred to new server hardware and the RAID arrays were rebuilt on the replacement system.

The hardware swap introduced an additional recovery complication: the replacement server now carried a different storage network identity. Compute nodes across the cluster still held cached connection sessions pointing to the original hardware identity, and VM restart attempts subsequently failed. When a cleanup of stale connection mappings across affected compute nodes in the fleet was performed existing automation tooling proved ineffective. This significantly hampered recovery efforts. Leg B was confirmed fully operational at 20:45 UTC, and affected VMs began recovering. VMs where automatic recovery was not possible were remediated manually in the following hours.

During the subsequent synchronization process to re-establish redundancy between the new legs, the provisioning service was deactivated in VDCs that contained VMs served by the affected storage pair. This meant customers were temporarily unable to roll out changes to affected VMs. Storage redundancy was fully restored on 15 May, once all affected volumes had been migrated back to a fully redundant configuration. At this point, the service was fully recovered.

How was that possible?(Root Cause)

The incident was caused by two independent hardware failures on the two servers of a single redundant storage pair, occurring several hours apart. Each failure on its own would not have caused customer impact. The combination of both, within the window between the first failure and the completion of redundancy migration, eliminated the redundancy that the architecture is designed to provide.

Primary failure - Leg A. The errors observed on Leg A were consistent with degradation of a shared physical interconnect - a SAS backplane or cabling issue - rather than end-of-life wear on individual disks. This pattern explains both why multiple disks failed in a correlated way, and why repair attempts could not outpace degradation: the rebuild operations themselves exercise the interconnect, and a second disk faulted under rebuild load before the array could complete. A stale array member prevented full reassembly, and the server was effectively out of service after its reboot.

Secondary failure - Leg B. While migration of redundancy to alternative storage was in progress, Leg B suffered an independent disk disconnection. The exact root cause of this second failure is still under investigation, but the symptoms were consistent with previously observed wear-related failures within the same hardware generation and make.

Recovery complication - network identity mismatch. When the disks from Leg B were transferred into a replacement chassis, the new server presented a different storage network identity. Compute nodes in the cluster, however, retained cached connection sessions pointing to the original hardware identity. VM restart attempts therefore failed against the replacement hardware until those sessions were invalidated across every affected compute node. As available automation tooling proved ineffective specialized tooling needed to be developed and tested during the incident.

In summary: a single-leg operation window - opened by the first hardware failure - coincided with an independent second hardware failure on the surviving leg before redundancy could be restored. The recovery path then required full hardware replacement, which surfaced inadequate automation in the storage-to-compute identity handover which prolonged the recovery process significantly.

What we are doing to prevent recurrence

We have grouped the remediation into completed tasks, immediate measures, and a mid-term reviews to address the gaps that turned a localized hardware failure into a prolonged customer-impacting outage.

Already taken:

Complete server replacement: All disks from the affected servers were physically transferred to verified replacement hardware, with RAID arrays fully rebuilt and verified on the new platform.
Disk replacement and array verification: All involved disks were replaced; arrays were rebuilt and integrity-verified before being returned to service.

Immediate (June 2026):

Automated session migration on hardware replacement: Update and validate tooling that automatically refreshes storage network identity mappings across all compute nodes when a storage server of this configuration is replaced in the manner required during this incident. This directly addresses the cleanup step that extended recovery by several hours.
Single-leg exposure review: Tighten the operational window during which a degraded redundant pair is allowed to operate on a single leg in configurations in locations where Dynamic Leg Swapping is not available. Dynamic Leg Swapping refers to a progress where a single leg failure initiates an automated sync to alternate storage. Repair on the affected leg can continue and depending on which redundancy recovery strategy is quicker (repair vs migration) the leg that is available quickest will be chosen. This bounds the time spent on repair efforts, simplifies decision-making during incident response, and shrinks the window in which a subsequent hardware failure can directly impact connected workloads.

Mid-term (within Q3):

Dynamic Leg Swapping Rollout: We assess the technical prerequisites to retrofit locations to support Dynamic Leg Swapping or provide alternative strategies to shorten (automated) redundancy restoration in data centers.
Recovery Tool Assessment:: The incident has shown that existing recovery tooling needs to be reviewed and validated to work against all existing hardware and configuration combinations in the fleet.
Hardware generation review: Comprehensive review of the affected storage hardware generation, covering firmware versions, backplane integrity, and environmental conditions. Failure patterns will be evaluated to rule out problematic combinations. Components showing early signs of degradation will be proactively replaced or upgraded.

Closing Remarks

We recognize that this incident produced a full I/O outage for the virtual machines in environments that depended on this storage pair, lasting over 15 hours, and that volumes operated without their secondary failover path for several hours before and after the incident. For any production workload, an outage and loss of redundancy of this duration is significant and unacceptable.

This incident not only resulted in service disruption and subsequent degradation, but also led to significant effort for partners and customers working on their side to mitigate the resulting outage.

We believe that the measures we have completed and have planned will effectively reduce the risk of a similar failure pattern. The planned review will help to identify and reduce risk factors further. The assessment of recovery tooling will help us recover quicker and increase the robustness of our incident response.

Thank you for your patience during the outage and recovery and your engineering teams' constructive coordination and support throughout this event.

Posted May 21, 2026 - 11:54 UTC

Resolved

As redundancy has been fully restored, we are resolving this incident. A RCA is being prepared and we estimate that it will be shared here today or tomorrow.

Posted May 18, 2026 - 07:38 UTC

Monitoring

Our team has recovered the remaining VMs. It’s important to note that storage redundancy is not yet fully restored. Due to necessary data synchronization processes and the volume of data involved, full recovery will take several additional hours.

Important notice for affected customers:
- Load: To ensure stability of the recovered VMs, please try to avoid excessive storage load where possible.
- Provisioning: Provisioning jobs against affected VMs will not be possible until redundancy is fully restored. Consequently, making state changes will not be possible during this time.
- VM States: During the incident, the state of affected VMs may have changed; some may currently be in "shutoff" state.
- Manual Intervention: To avoid unintended impact on customers for whom "shutoff" is the desired state, we cannot automatically restart all VMs without approval.

Due to these constraints, we kindly ask affected customers to carefully check their resources and request state changes via support ticket (support@cloud.ionos.com) where necessary. This is only needed until redundancy is restored.

We are moving this incident to Monitoring status while work to restore full redundancy continues. We will provide an update here once that work is complete. A detailed Root Cause Analysis (RCA) will be published subsequently, explaining the cause of the incident, the measures taken to restore service, and the steps implemented to prevent recurrence.

We thank our customers and partners for their patience and cooperation in ensuring the full recovery of affected services and workloads.

Posted May 12, 2026 - 23:06 UTC

Update

We have successfully recovered additional VMs. However, because full storage redundancy has not yet been reached for these systems, Storage Provisioning Jobs will currently fail. We ask affected customers to refrain from making configuration changes or placing excessive load on these servers until redundancy is fully restored.

Posted May 12, 2026 - 21:08 UTC

Update

The team has succeeded in restoring the first VMs during the initial run. We are currently working on rolling out the restoration process further and will update on the progress.

Posted May 12, 2026 - 20:12 UTC

Update

Our development team is working on re-establishing connectivity sessions of the affected VMs so that the recovery effort can continue. The first runs are currently starting.

Posted May 12, 2026 - 20:00 UTC

Update

The team has encountered a blocker while restarting the remaining VMs due to connectivity issues between the replaced hosts and the storage servers. This has significantly increased the complexity of the recovery effort. We are currently collaborating with a specialized development team to bring the remaining systems back online. We will provide the next update at 20:00 UTC or before if there are meaningful updates.

Posted May 12, 2026 - 19:23 UTC

Update

We have made progress regarding the recovery, and are currently in the process of restarting affected Virtual Machines.

Posted May 12, 2026 - 16:01 UTC

Update

The spares have been made available. They are currently brought online to allow recovery to continue. We estimate that we will have another update on the progress within the hour.

Posted May 12, 2026 - 13:53 UTC

Update

We found relevant hardware to be defective, this has to be replaced first in the data center.
After the physical connection is back, we need to sync and update the storage servers to get the VMs back online.
We are working as fast as we can to get this fixed, however, this might take some hours.

Posted May 12, 2026 - 12:49 UTC

Identified

Our storage team is currently working on a firmware restore and upgrade on the affected storage servers and will then attempt to bring them back into service.

Posted May 12, 2026 - 11:23 UTC

Investigating

Unfortunately, we ran into further issues with our Storages in Karlsruhe, so the team keeps on repairing.
Affected VMs (and VDCs) are not available and have to be restarted once they get back online.
We will keep you updated asap on any new information.

Posted May 12, 2026 - 11:04 UTC

Monitoring

All VMs and Storages and VDCs are back online and available.
We keep monitoring the situation.

Posted May 12, 2026 - 08:59 UTC

Identified

Our developers restored the storage availability, we are now restarting the affected VMs to bring them back online.

Posted May 12, 2026 - 08:02 UTC

Investigating

We are currently experiencing restrictions regarding our Storage service in one cluster in our Karlsruhe region. We are working to restore regularly operation of our services as quickly as possible.

Posted May 12, 2026 - 06:15 UTC

This incident affected: Location DE/FKB (Storage).