Storage Service Degradation

Incident Report for IONOS Cloud

Postmortem

This is our Root Cause Analysis. We will update the RCA as we complete our investigation, inventory, and planning of repair items.

Update 19.11.2025 - Expanding on RTRS connection loss as contributing factor

Update 11.12.2025 - Adding repair items related to network upgrades and their ETAs

What Happened?

A defective Host Channel Adapter (HCA) caused instability in the InfiniBand fabric starting at 2025-11-17 22:08:37 UTC. While this initial disruption was relatively brief, these instabilities (specific, frequent, simultaneous reconnect attempts over a short period of time) severed storage connections and triggered a bug in a software package used on our storage servers, causing an exception and triggering reboots across several storage servers at the same time. The loss of connectivity combined with these reboots caused loss of storage connectivity across VMs in the affected cluster.

How was that possible?

While the initial network disruption was relatively brief, it severed RTRS (RDMA (Remote Direct Memory Access) Transport) connections to some storage servers. It also triggered a bug in a software package used in our storage servers. This caused several storage servers to reboot unexpectedly, which decreased redundancy and increased the likelihood of volumes becoming unavailable to VMs. The HCA hardware fault, combined with the severed RTRS sessions and the unexpected reboot of several storage servers, led to a loss of redundancy and connectivity and has been identified as the technical root cause of this incident. While the infrastructure could be recovered relatively quickly, recovering the affected VMs and services took longer than expected. We are currently investigating the exact reason for the extended recovery duration.

How did we respond?

The InfiniBand ports of the problematic server were shut down, successfully isolating the defective HCA. The fabric topologies reconverged quickly. The resulting storage connectivity issues needed to be addressed by re-attaching storage, then re-establishing redundancy for the storage, and reprovisioning/resetting affected VMs.

What are we doing to prevent this from happening again?

While the investigation is still ongoing, we have already inventoried the following actions. We will assign ETAs to them as soon as investigation and planning is finished.

  • Hardware & Fabric Resilience

    • HCA Replacement: Replace the defective HCA and thoroughly test the new component. (Done)
    • InfiniBand Migration: (End of February 2026)
    • Control Plane Hardening and SDN/VNF Updates (ongoing - End of December 2025)
  • Storage Stability

    • Bug Mitigation: Investigate and fix the software package bug to prevent storage server reboots triggered by similar incident. (Done - In Rollout)
    • Resilience Review: Implement measures to prevent this specific cascading failure in the future. (Ongoing)
  • Improved Recovery

    • Improve tooling and procedures to allow for a swifter recovery, specifically of storage connectivity loss at scale. (Ongoing)

As we continue our investigation, we will update and add to this Root Cause Analysis. We would like to thank your for your patience and teamwork over the last hours.

Posted Nov 18, 2025 - 22:57 UTC

Resolved

After monitoring for and responding to residual issues, we are making this incident as resolved and will publish our preliminary Root Cause Analysis.
Posted Nov 18, 2025 - 22:33 UTC

Update

We want to share another note on the recovery efforts. While at the infrastructure level the incident is resolved, we recommend checking the status of virtual machines and resources deployed in your environment.
Depending on how the operating systems and/or applications have been able to deal with the loss of connectivity to the storage, crashes and freezes may have occurred that might require manual intervention (e.g., a restart) to resolve.

Unfortunately, we cannot perform this action for you and want to make you aware of this important check to make sure your environment is fully operational after this incident.

The team is currently working on compiling the Root Cause Analysis. We expect that we will have a preliminary version to share with you today.
Posted Nov 18, 2025 - 01:43 UTC

Monitoring

All the remaining volumes have been successfully recovered and redundancy is being established to the last volumes, as well. We also have seen services recover. We are closely monitoring the situation and are completing final tasks to close the incident.
We have also started investigation into the root cause of the incident and will publish the analysis here.

We want to thank you for your patience thus far and commit to a transparent and thorough investigation.
Posted Nov 18, 2025 - 01:01 UTC

Update

We believe we have identified a solution for the volumes that are still unavailable (approx.. 10%) and are currently rolling out the fix to them. In parallel, we are working to restore redundancy to the already-recovered volumes and are making good progress in this area, as well.
Posted Nov 18, 2025 - 00:27 UTC

Update

The recovery of the affected volumes has progressed well overall. We have a subset of volumes were recovery takes longer than expected. We are currently looking into restoring the remaining volumes and will then restore redundancy.
Posted Nov 18, 2025 - 00:11 UTC

Update

As recovery efforts progress, we are beginning to see improvements in service availability. While we are still actively mitigating the incident, we want to share a preliminary, high-level overview of the situation:
We have identified a machine that caused a widespread network disruption in FRA. This disruption cascaded, degrading the performance of a network that provided connectivity to a storage server. The resulting loss of connectivity led to service disruptions and degradation for compute resources in the datacenter.

We will keep you updated about the progress of the recovery as well as insights into the root cause of the incident.
Posted Nov 17, 2025 - 23:44 UTC

Update

We needed to update the scope of the incident. A recovery plan has been put together, and recovery efforts are currently already underway. We are monitoring the process closely and will share more details as they become available. Another update will be made until 00:00 UTC latest.
Posted Nov 17, 2025 - 23:31 UTC

Update

We are continuing to work on a fix for this issue.
Posted Nov 17, 2025 - 23:27 UTC

Identified

We are currently recovering a storage server outage in the datacenter. We are currently migrating affected storages/VMs from the host. We will provide another status update before 23:30 UTC.
Posted Nov 17, 2025 - 23:14 UTC

Investigating

We are currently investigating a storage issue
Posted Nov 17, 2025 - 22:55 UTC
This incident affected: Location DE/FRA (Compute, Storage, Network).