Today we are releasing a preliminary Root Cause Analysis that represents the current state of the investigation focusing on the network incident. While we have high confidence in the technical details concerning the network incident, we will add to this analysis as new information becomes available related to the services affected by the spillover effects of this main incident. We expect to publish the complete RCA by the end of this week.
UPDATE 13.03.2026 - We wanted to provide an update to the published preliminary RCA to enrich the findings and increase transparency related to other affected services.
During a scheduled maintenance window to replace a faulty switch in a cluster situated in the TXL data center, a network outage occurred.
The incident triggered a spillover effect, resulting in a significant queue of unprocessed provisioning jobs from the affected cluster. Following network recovery, this backlog generated significant resource locking delays, slowing processing of queued jobs. This impacted systems and services relying on self-healing automations triggered by the primary incident or undergoing changes at that time.
UPDATE: In the following we want to explain in more details the impact on other affected services:
IAM: For customers, the DCD frontend was unavailable during the incident. This was due to the fact that the underlying IAM service was not reachable for the frontend. During the incident, an IP assignment maintenance job had been triggered, which was pending completion for an extended period of time, rendering the service unavailable to the frontend.
AI Model Hub: The AI Model hub service was affected directly by the network outage. While the service itself remained functional throughout the incident, a network connectivity loss on the Managed Kubernetes Cluster meant that it was unavailable for customers.
DBaaS/Managed Kubernetes: DBaaS and Managed Kubernetes were impacted through two separate mechanisms. First, servers hosting DBaaS and Kubernetes workloads experienced a direct loss of BGP connectivity as a result of the network outage, causing immediate service disruption. Second, the resulting delays in provisioning queue processing caused key operations such as volume attach/detach activities as well as automatically triggered self-healing mechanisms to be deferred, leading to further delays during the recovery phase when network connectivity was restored.
Provisioning: Provisioning was first directly impacted by the connectivity loss to resources on the affected cluster. This led to an initial spike in the processing queue jobs. As services affected by the connectivity issues entered self-healing additional jobs were placed in the queue. This led to an exponential increase in job numbers, which, after the network incident was resolved, led to resource locking bottlenecks, which reduced the normal processing speed of the queue. This hampered automated recovering mechanisms, leading to extended service degradations, especially for IAM and Managed Kubernetes.
The network maintenance took place in Topology 2 of 2 in the affected cluster. During the maintenance, Topology 2 of 2 was disabled as planned. At this point, the Multicast Forwarding Table (MFT) capacity across all switches in Topology 1 was already strained due to saturation from IPoIB multicast groups. Due to this, Topology 1 was already operating in a vulnerable state.
Topology 1 of 2 then suffered a critical gateway overload, initiated by a link flap on a switch which flooded the InfiniBand control plane with alerts, triggering a fabric resweep - an automated self-healing mechanism aiming to reset the network map. Because the MFT was already at capacity, the reprogramming required for the reset failed. The resulting surge of unsuccessful route updates overwhelmed the management layer, causing BGP sessions to drop and ultimately severing connectivity for hosts in the affected cluster.
As the connectivity loss made it impossible for provisioning jobs to succeed in the affected cluster all changes scheduled for resources on the affected hosts were queued.
The network incident was resolved by a switchover from the affected Control Plane to its standby, a reboot of the locked-up gateways, and recovery of Topology 1 from the scheduled maintenance.
Although the network incident was resolved, the resulting volume of provisioning jobs triggered significant resource locking. This bottleneck delayed critical updates, leaving several (self-healing) services in a degraded state as they waited for their queued jobs to process.
Today, we are highlighting the measures - both planned and currently deployed - designed to reduce the likelihood of a similar incident. We are listing these measures by affected service, but want to underline that the “interconnectedness” of the services through the provisioning queue was identified as a key contributor to the duration of the service degradation. While individual measures will make the services individually more robust, we are also implementing architectural changes to the provisioning queue to ensure better de-coupling and reduce the risk of spillovers as observed here.
In an RCA released for a related incident on 11.03.2026 triggered by the same technical root cause, we have published immediate technical actions and long-term structural improvements. To improve consistency, we decided to share these measures and their implementation status here, as well.
Immediate Technical Actions:
Long-term Structural Improvements:
With these measures now in place, we are confident the immediate cause of the cluster instability has been mitigated. Our long-term strategy will further enhance network stability and scalability across all data centers addressing the identified bottlenecks
Immediate Technical Actions: IAM Service Resilience during Maintenance: We have hardened IAM maintenance protocols to ensure critical identity services remain functional even if primary Cloud APIs are temporarily unavailable. (DONE)
Long-term Structural Improvements: Hardening Core Identity Services (IAM): We are expediting the running migration of our Identity and Access Management (IAM) system to a dedicated platform. This removes IAM's dependency on the standard managed Kubernetes clusters (Q3 2026).
As the AI Modelhub was affected by connectivity issues caused by issues in the underlying Managed Kubernetes setup, the service will directly benefit from the measures planned to increase Managed Kubernetes resilience.
Improvements to Management and Provisioning Handling: Structural improvements are planned that make cluster setup and management more seamless, reliable, and automated. This will minimize the surface for service disruptions during updates and changes, like those triggered by self-healing mechanisms. Key improvements include better handling of node maintenance, more predictable scaling, and enhanced stability for production workloads. These changes are part of a broader effort to improve the Kubernetes experience, but will also help address the specific issues relevant in this incident by introducing a dedicated provisioning provider. (Q3 2026)
Enhanced Provisioning Resilience: To prevent "logjams" during maintenance or incidents, we are introducing automated circuit breakers within our provisioning engine to reduce the risk for overloads and perpetual queue buildup caused by self-healing services. Several quality of service improvements are planned to improve visibility and control over job execution and prioritization within the provisioning queue. Additional steps towards preparing a de-coupling of provisioning queues will be made in an upcoming provisioning maintenance to test provisioning switchover from one DC to another. (Q1 2026)
We hope this update to our preliminary RCA increases transparency regarding how the network incident in the affected cluster unfolded and how it influenced other services. We have derived a series of improvements from this event that will help make our services more resilient in the future. While our ongoing network modernization initiative will provide greater performance and stability, we also aim to reduce the 'blast radius' of future incidents by identifying and addressing dependencies within our services. We recognize that this incident has affected customers and partners in various ways. It is important to us to provide a comprehensive, transparent account of the disruption, as well as the initiatives we have implemented - and will continue to put into place - to help avoid similar issues moving forward.