Networking Issue in TXL

Incident Report for IONOS Cloud

Postmortem

Update 18.11.2025 - After further investigation we want to update our preliminary Root Cause Analysis with additional insights

Update 24.11.2025 - We want to provide more concrete timelines for the action items

What happened:

On 14.11.2025 between 13:54 and 14:47 UTC we observed a network connectivity issue in our TXL datacenter, affecting multiple services and leading to increased latency and packet loss.

This happened because:

A datacenter cluster experienced network instability due to a broken failover behavior in the control plane of our InfiniBand (IB) fabrics. The core issue was a constant switching of Master/Standby roles between the redundant InfiniBand Subnet Managers, leading to inconsistencies in the fabric.

The root cause of this instability was identified as overly aggressive timeout and retry settings in the Subnet Manager’s High Availability (HA) configuration. These sensitive settings allowed a minor, otherwise insignificant latency spike to trigger an initial failover.

This created a self-perpetuating cycle: the resource overhead from the initial failover caused a spike in CPU load on the IB gateways. This increased load resulted in further latency, which the aggressive HA settings interpreted as a failure, immediately triggering another failover. This loop continued until the Subnet Manager election process stabilized, allowing the system to recover automatically.

What we are doing to prevent this from happening again:

  1. Further investigation into the Subnet Manager Log Messages to identify the trigger of the unexpected switching behavior (completed)
  2. Hardening and improvement of the Subnet Manager High Availability (HA) configuration (first sites within Q4)
  3. Migration of Subnet Manager to separate servers that are not part of the data plane (first sites within Q4)

We understand that the incident has had a negative impact on our customers and partners. We believe that these steps will make our setup more resilient and help avoid a similar incident in the future.

Posted Nov 15, 2025 - 13:21 UTC

Resolved

All pending systems have reported in as healthy. We will mark this incident as resolved. We will publish the Root Cause Analysis (RCA) here as soon as it becomes available. Thank you for your patience.
Posted Nov 14, 2025 - 16:13 UTC

Monitoring

We are seeing further improvement across the environment. We are currently actively monitoring the recovery and are confirming the Root Cause of the Incident and will publish the result here.
Posted Nov 14, 2025 - 15:28 UTC

Update

We are observing some improvements in our monitoring metrics after our recent intervention, and some previously affected systems are reporting back as healthy. We will continue to monitor the environment closely as we look for any other potential contributing factors. We will provide our next update no later than 15:30 UTC
Posted Nov 14, 2025 - 15:02 UTC

Identified

We have identified and disabled a defective switch that we believe was a potential culprit. We are now closely monitoring the environment to assess the impact of this action, as we suspect there may be additional contributing factors. Our investigation continues, and we will provide the next update by 15:30 UTC at the latest.
Posted Nov 14, 2025 - 14:56 UTC

Update

We are currently investigating instances of Border Gateway Protocol session flapping in the TXL datacenter. Our network experts are currently pinpointing the root cause of the issue. You can expect the next update here at 15:00 UTC.
Posted Nov 14, 2025 - 14:33 UTC

Investigating

We are currently investigating reports related to connectivity issues in our TXL datacenter. We will update you as soon as possible.
Posted Nov 14, 2025 - 14:21 UTC
This incident affected: Location DE/TXL (Network).