What happened?
A network control-plane failure in the TXL data center caused progressive service degradation and a partial outage for one cluster. The incident resulted in intermittent connectivity loss for a subset of customers, with traffic impact ranging from 5% to 20% during three distinct intervals on 11.03.2026:
The issue resolved automatically once the affected network devices were rebooted sequentially and additional configuration changes have been applied. The incidents prompted a series of emergency maintenance to stabilize the cluster.
How was this possible? (Root Cause)
The underlying cause was the exhaustion of multicast forwarding resources on the switches serving the affected topology. This trigger is technically identical to the previous incident that occurred on 03.03.2026.
When the forwarding table reaches its maximum capacity, the network control plane cannot program required updates into the forwarding tables of the switches. This continuous failure to push updates overwhelms the system, resulting in severe CPU overload and an out-of-memory (OOM) crash of the control-plane process. Without this process, the fabric is unable to maintain stable forwarding, ultimately leading to BGP session flaps and the observed loss of connectivity.
What are we doing to prevent recurrence?
As part of the measures established following the 03.03.2026 incident, we had already rolled out configuration changes to one of the two topologies in the affected cluster. These measures aim to significantly reduce the load of the control-plane by optimizing resweep and failover times. Due to existing instabilities in the second topology, these changes had not yet been applied there.
In response to the degradations observed on 11.03.2026, we accelerated the rollout of these optimizations via Emergency Maintenance across all TXL topologies and several other data centers yesterday. Post-implementation, we observed a significant positive impact, with OpenSM loads being substantially reduced. Furthermore we’ve reconfigured management services to significantly reduce the multicast forwarding entries resulting in a significantly lower base load of the control-plane process.
Immediate Technical Actions:
Long-term Structural Improvements:
With these measures now in place, we are confident the immediate cause of the cluster instability has been mitigated. Our long-term strategy will further enhance network stability and scalability across all data centers while addressing the identified bottlenecks.
We recognize that these disruptions have impacted our customers and partners, and we sincerely appreciate your patience regarding the short-notice emergency maintenance announced yesterday. We are continuing to monitor the cluster closely to ensure all deployed fixes remain effective.