We have finished compiling the Root Cause Analysis for the Incident.
What happened
On 2026-01-13, our Cluster 601 in TXL (Berlin) experienced a network instability event. During this time, our servers in the cluster experienced frequent flapping of BGP sessions with the gateway/core routers. This caused periods of connectivity loss, lasting between 5 to 30 seconds for each affected server during the flaps.
How could this happen
The incident was triggered during a planned maintenance activity on one of the cluster's network topologies (topo 1). We expected the maintenance to be transparent for your customers.
Root Cause: The maintenance activity led to an unexpected spike in the Fabric Manager process which is the control plane of the fabric on the second and healthy network topology (topo 2). This spike caused the BGP routing process on the gateway to be starved of CPU resources, leading to the BGP sessions failing and flapping.
Cross-Topology Impact: The network maintenance activity on one side of our system unexpectedly spilled over and impacted the second, healthy side. This happened because the core operating system and network software did not treat the two network segments as fully independent. When the first segment was shut down, the system triggered a global "cleanup" process that caused a brief overload on the healthy segment's fabric manager , which in turn led to the routing failures (BGP flaps).
In short, a procedure intended for one part of the network had an unforeseen side effect on the other, healthy part due to shared dependencies in the underlying software.
What are we doing to prevent this from happening again
We have identified several long-term remediation actions to prevent this specific issue from recurring: