Network Connectivity Issues in TXL

Incident Report for IONOS Cloud

Postmortem

What happened?

A network control-plane failure in the TXL data center caused progressive service degradation and a partial outage for one cluster. The incident resulted in intermittent connectivity loss for a subset of customers, with traffic impact ranging from 5% to 20% during three distinct intervals on 11.03.2026:

  • 05:20 – 05:25 UTC
  • 11:58 – 12:10 UT
  • 17:47 – 18:08 UTC

The issue resolved automatically once the affected network devices were rebooted sequentially and additional configuration changes have been applied. The incidents prompted a series of emergency maintenance to stabilize the cluster.

How was this possible? (Root Cause)

The underlying cause was the exhaustion of multicast forwarding resources on the switches serving the affected topology. This trigger is technically identical to the previous incident that occurred on 03.03.2026.

When the forwarding table reaches its maximum capacity, the network control plane cannot program required updates into the forwarding tables of the switches. This continuous failure to push updates overwhelms the system, resulting in severe CPU overload and an out-of-memory (OOM) crash of the control-plane process. Without this process, the fabric is unable to maintain stable forwarding, ultimately leading to BGP session flaps and the observed loss of connectivity.

What are we doing to prevent recurrence?

As part of the measures established following the 03.03.2026 incident, we had already rolled out configuration changes to one of the two topologies in the affected cluster. These measures aim to significantly reduce the load of the control-plane by optimizing resweep and failover times. Due to existing instabilities in the second topology, these changes had not yet been applied there.

In response to the degradations observed on 11.03.2026, we accelerated the rollout of these optimizations via Emergency Maintenance across all TXL topologies and several other data centers yesterday. Post-implementation, we observed a significant positive impact, with OpenSM loads being substantially reduced. Furthermore we’ve reconfigured management services to significantly reduce the multicast forwarding entries resulting in a significantly lower base load of the control-plane process.

Immediate Technical Actions:

  • Resource Management: Adjusted control-plane configurations to use less aggressive failover timers and reduced heavy sweep activity, lowering the load on the forwarding table. (DONE)
  • Multicast Load Reduction: Implemented measures to decrease the number of multicast group memberships, further reducing pressure on the forwarding table. (DONE)
  • Proactive Monitoring: Established continuous monitoring of forwarding table utilization and control-plane memory to trigger early alerts before capacity limits are reached. (DONE)

Long-term Structural Improvements:

  • Control-Plane Isolation: We are migrating our control-plane to dedicated, high-performance servers. This work, which began in Q4 2025, ensures the separation of network management from data-plane traffic to eliminate resource competition. We are also performing a deep-dive audit of IPv6 configurations and IPoIB driver settings. (Ongoing, ETA: Q2 2026)
  • Network Modernization: Our interconnect fabric is undergoing a strategic modernization program—including switches, gateways, and drivers—to increase both resiliency and performance. (Ongoing, ETA: Q3 2026)

With these measures now in place, we are confident the immediate cause of the cluster instability has been mitigated. Our long-term strategy will further enhance network stability and scalability across all data centers while addressing the identified bottlenecks.

We recognize that these disruptions have impacted our customers and partners, and we sincerely appreciate your patience regarding the short-notice emergency maintenance announced yesterday. We are continuing to monitor the cluster closely to ensure all deployed fixes remain effective.

Posted Mar 13, 2026 - 10:42 UTC

Resolved

We are marking this incident as resolved. Our Network Team will publish an RCA here, once it is compiled.
Posted Mar 11, 2026 - 21:19 UTC

Monitoring

We are placing the incident in a monitoring state. Our Network Team is closely monitoring the cluster and working to restore full redundancy.
Posted Mar 11, 2026 - 19:27 UTC

Update

The deployed change has had positive effect. We are downgrading the impact level while continuing to monitor the cluster closely.
Posted Mar 11, 2026 - 18:54 UTC

Identified

In response to monitoring alerts, our network team deployed a change to stabilize the network in the affected cluster. We will post another update at 19:00 UTC—or as soon as new information becomes available.
Posted Mar 11, 2026 - 18:31 UTC

Investigating

We are currently investigating network connectivity issues in our TXL datacenter.
Posted Mar 11, 2026 - 18:13 UTC
This incident affected: Location DE/TXL (Network).