Network service degraded

Incident Report for IONOS Cloud

Postmortem

We have finished compiling the Root Cause Analysis for the Incident.

What happened

On 2026-01-13, our Cluster 601 in TXL (Berlin) experienced a network instability event. During this time, our servers in the cluster experienced frequent flapping of BGP sessions with the gateway/core routers. This caused periods of connectivity loss, lasting between 5 to 30 seconds for each affected server during the flaps.

How could this happen

The incident was triggered during a planned maintenance activity on one of the cluster's network topologies (topo 1). We expected the maintenance to be transparent for your customers.

Root Cause: The maintenance activity led to an unexpected spike in the Fabric Manager process which is the control plane of the fabric on the second and healthy network topology (topo 2). This spike caused the BGP routing process on the gateway to be starved of CPU resources, leading to the BGP sessions failing and flapping.

Cross-Topology Impact: The network maintenance activity on one side of our system unexpectedly spilled over and impacted the second, healthy side. This happened because the core operating system and network software did not treat the two network segments as fully independent. When the first segment was shut down, the system triggered a global "cleanup" process that caused a brief overload on the healthy segment's fabric manager , which in turn led to the routing failures (BGP flaps).

In short, a procedure intended for one part of the network had an unforeseen side effect on the other, healthy part due to shared dependencies in the underlying software.

What are we doing to prevent this from happening again

We have identified several long-term remediation actions to prevent this specific issue from recurring:

  • Fabric Manager Optimization: We are evaluating and implementing specific tweaks to the Fabric Manager configuration. This update is already deployed in the US, with plans for the TXL (Berlin) cluster in calendar week 3 and the rest of the EU in calendar week 4. These improvements are an addition to the improvements already deployed in Q4 last year targeting the core switches and will help harden the IB gateway component against overloads, too. (To be completed in January)
  • Dedicated Resources: We are in the process of migrating the fabric manager to dedicated servers to isolate its operation from the BGP routing process, ensuring that an issue in one cannot directly impact the other. (To be completed in February)
  • Topology Maintenance Improvements: We have identified a critical command in the maintenance script that has triggered issues that made the spillover effect more likely. We have revised the runbook for similar maintenance to avoid such commands being run (DONE)
Posted Jan 15, 2026 - 14:41 UTC

Resolved

The issue is now Confirmed to be resolved. Investigation for a Root Cause has begin and the RCA will be published here.
Posted Jan 13, 2026 - 13:54 UTC

Monitoring

All nodes have now been updated and the fix implemented. Our team is continuing to monitor the situation but all customer connectivity is back to normal. RCA will be published as soon as our team have completed their analysis.
Posted Jan 13, 2026 - 12:23 UTC

Update

A fix Has been implemented and Customers should see traffic and connectivity return to normal at this point.
Posted Jan 13, 2026 - 12:09 UTC

Update

The team has started to implement a fix. We are now testing to see if this fix is enough. Some /many customers should be seeing improvements
Posted Jan 13, 2026 - 11:47 UTC

Identified

The team has now Identified the source and are investigating remediation steps
Posted Jan 13, 2026 - 11:27 UTC

Update

Our Network team is continuing to investigate network issue. They believe they Have identified the source and are working on a remediation plane. More information to come
Posted Jan 13, 2026 - 11:23 UTC

Update

We are continuing to investigate this issue.
Posted Jan 13, 2026 - 11:03 UTC

Investigating

We are writing to inform you that we have been experiencing sporadic connection issues and substantial delays on packet delivery.

Network technicians have started working on the occurrence immediately after detection and will isolate the problem and solve the issue as quick as possible. However, it is possible that there will be a certain degradation in connection quality affecting individual virtual resources.

We will inform you as soon as the functionality has been restored.
Posted Jan 13, 2026 - 10:10 UTC
This incident affected: Location DE/TXL (Network).