Network Service Degradation in TXL

Incident Report for IONOS Cloud

Postmortem

Root Cause Analysis

Today we are releasing a preliminary Root Cause Analysis that represents the current state of the investigation focusing on the network incident. While we have high confidence in the technical details concerning the network incident, we will add to this analysis as new information becomes available related to the services affected by the spillover effects of this main incident. We expect to publish the complete RCA by the end of this week.

UPDATE 13.03.2026 - We wanted to provide an update to the published preliminary RCA to enrich the findings and increase transparency related to other affected services.

What happened?

During a scheduled maintenance window to replace a faulty switch in a cluster situated in the TXL data center, a network outage occurred.

The incident triggered a spillover effect, resulting in a significant queue of unprocessed provisioning jobs from the affected cluster. Following network recovery, this backlog generated significant resource locking delays, slowing processing of queued jobs. This impacted systems and services relying on self-healing automations triggered by the primary incident or undergoing changes at that time.

UPDATE: In the following we want to explain in more details the impact on other affected services:

IAM: For customers, the DCD frontend was unavailable during the incident. This was due to the fact that the underlying IAM service was not reachable for the frontend. During the incident, an IP assignment maintenance job had been triggered, which was pending completion for an extended period of time, rendering the service unavailable to the frontend.

AI Model Hub: The AI Model hub service was affected directly by the network outage. While the service itself remained functional throughout the incident, a network connectivity loss on the Managed Kubernetes Cluster meant that it was unavailable for customers.

DBaaS/Managed Kubernetes: DBaaS and Managed Kubernetes were impacted through two separate mechanisms. First, servers hosting DBaaS and Kubernetes workloads experienced a direct loss of BGP connectivity as a result of the network outage, causing immediate service disruption. Second, the resulting delays in provisioning queue processing caused key operations such as volume attach/detach activities as well as automatically triggered self-healing mechanisms to be deferred, leading to further delays during the recovery phase when network connectivity was restored.

Provisioning: Provisioning was first directly impacted by the connectivity loss to resources on the affected cluster. This led to an initial spike in the processing queue jobs. As services affected by the connectivity issues entered self-healing additional jobs were placed in the queue. This led to an exponential increase in job numbers, which, after the network incident was resolved, led to resource locking bottlenecks, which reduced the normal processing speed of the queue. This hampered automated recovering mechanisms, leading to extended service degradations, especially for IAM and Managed Kubernetes.

How was this possible? (Technical Root Cause)

The network maintenance took place in Topology 2 of 2 in the affected cluster. During the maintenance, Topology 2 of 2 was disabled as planned. At this point, the Multicast Forwarding Table (MFT) capacity across all switches in Topology 1 was already strained due to saturation from IPoIB multicast groups. Due to this, Topology 1 was already operating in a vulnerable state.

Topology 1 of 2 then suffered a critical gateway overload, initiated by a link flap on a switch which flooded the InfiniBand control plane with alerts, triggering a fabric resweep - an automated self-healing mechanism aiming to reset the network map. Because the MFT was already at capacity, the reprogramming required for the reset failed. The resulting surge of unsuccessful route updates overwhelmed the management layer, causing BGP sessions to drop and ultimately severing connectivity for hosts in the affected cluster.

As the connectivity loss made it impossible for provisioning jobs to succeed in the affected cluster all changes scheduled for resources on the affected hosts were queued.

The network incident was resolved by a switchover from the affected Control Plane to its standby, a reboot of the locked-up gateways, and recovery of Topology 1 from the scheduled maintenance.

Although the network incident was resolved, the resulting volume of provisioning jobs triggered significant resource locking. This bottleneck delayed critical updates, leaving several (self-healing) services in a degraded state as they waited for their queued jobs to process.

What are we doing to prevent recurrence?

Today, we are highlighting the measures - both planned and currently deployed - designed to reduce the likelihood of a similar incident. We are listing these measures by affected service, but want to underline that the “interconnectedness” of the services through the provisioning queue was identified as a key contributor to the duration of the service degradation. While individual measures will make the services individually more robust, we are also implementing architectural changes to the provisioning queue to ensure better de-coupling and reduce the risk of spillovers as observed here.

Network

In an RCA released for a related incident on 11.03.2026 triggered by the same technical root cause, we have published immediate technical actions and long-term structural improvements. To improve consistency, we decided to share these measures and their implementation status here, as well.

Immediate Technical Actions:

Resource Management: Adjusted control-plane configurations to use less aggressive failover timers and reduced heavy sweep activity, lowering the load on the forwarding table. (DONE)
Multicast Load Reduction: Implemented measures to decrease the number of multicast group memberships, further reducing pressure on the forwarding table. (DONE)
Proactive Monitoring: Established continuous monitoring of forwarding table utilization and control-plane memory to trigger early alerts before capacity limits are reached. (DONE)

Long-term Structural Improvements:

Control-Plane Isolation: We are migrating our control-plane to dedicated, high-performance servers. This work, which began in Q4 2025, ensures the separation of network management from data-plane traffic to eliminate resource competition. We are also performing a deep-dive audit of IPv6 configurations and IPoIB driver settings. (Ongoing, ETA: Q2 2026)
Network Modernization: Our interconnect fabric is undergoing a strategic modernization program - including switches, gateways, and drivers - to increase both resiliency and performance. (Ongoing, ETA: Q3 2026)

With these measures now in place, we are confident the immediate cause of the cluster instability has been mitigated. Our long-term strategy will further enhance network stability and scalability across all data centers addressing the identified bottlenecks

IAM

Immediate Technical Actions: IAM Service Resilience during Maintenance: We have hardened IAM maintenance protocols to ensure critical identity services remain functional even if primary Cloud APIs are temporarily unavailable. (DONE)

Long-term Structural Improvements: Hardening Core Identity Services (IAM): We are expediting the running migration of our Identity and Access Management (IAM) system to a dedicated platform. This removes IAM's dependency on the standard managed Kubernetes clusters (Q3 2026).

AI Modelhub

As the AI Modelhub was affected by connectivity issues caused by issues in the underlying Managed Kubernetes setup, the service will directly benefit from the measures planned to increase Managed Kubernetes resilience.

DBaaS/Managed Kubernetes

Improvements to Management and Provisioning Handling: Structural improvements are planned that make cluster setup and management more seamless, reliable, and automated. This will minimize the surface for service disruptions during updates and changes, like those triggered by self-healing mechanisms. Key improvements include better handling of node maintenance, more predictable scaling, and enhanced stability for production workloads. These changes are part of a broader effort to improve the Kubernetes experience, but will also help address the specific issues relevant in this incident by introducing a dedicated provisioning provider. (Q3 2026)

Provisioning

Enhanced Provisioning Resilience: To prevent "logjams" during maintenance or incidents, we are introducing automated circuit breakers within our provisioning engine to reduce the risk for overloads and perpetual queue buildup caused by self-healing services. Several quality of service improvements are planned to improve visibility and control over job execution and prioritization within the provisioning queue. Additional steps towards preparing a de-coupling of provisioning queues will be made in an upcoming provisioning maintenance to test provisioning switchover from one DC to another. (Q1 2026)

We hope this update to our preliminary RCA increases transparency regarding how the network incident in the affected cluster unfolded and how it influenced other services. We have derived a series of improvements from this event that will help make our services more resilient in the future. While our ongoing network modernization initiative will provide greater performance and stability, we also aim to reduce the 'blast radius' of future incidents by identifying and addressing dependencies within our services. We recognize that this incident has affected customers and partners in various ways. It is important to us to provide a comprehensive, transparent account of the disruption, as well as the initiatives we have implemented - and will continue to put into place - to help avoid similar issues moving forward.

Posted Mar 09, 2026 - 18:10 UTC

Resolved

The Provisioning and DBaaS services have been restored to full operational status. The backlog has been cleared and no further service degradation is expected.
Our Network Team is currently preparing a comprehensive Root Cause Analysis (RCA) of this incident, which we will publish on this page. We anticipate releasing the complete analysis by tomorrow.
Thank you for your patience.

Posted Mar 03, 2026 - 21:49 UTC

Monitoring

We are setting the DCD frontend back to 'Operational' now that the dependency on the provisioning side has been resolved. Customers should now be able to use the DCD via the web frontend again.

We are setting the incident to 'Monitoring' status and will ensure the successful recovery of the still affected services. Customers might still experience a performance impact on provisioning-related activities, such as modifying infrastructure resources in the DCD.

We recommend to postpone non-mission critical/urgent changes until the incident is marked as "Resolved"

Posted Mar 03, 2026 - 15:25 UTC

Update

We are still waiting for the provisioning backlog to be processed. Customers might face extended provisioning times. DCD web-frontend availability and DBaaS services remain affected due to their dependency on provisioning assignments. We will continue to update the status page, though likely at a reduced cadence as we monitor backlog consumption.

In parallel, a Root Cause Analysis (RCA) has been initiated for the triggering incident. We will share the RCA here as soon as it becomes available.

Posted Mar 03, 2026 - 15:16 UTC

Update

We are setting AI Model Hub back to "operational"

Posted Mar 03, 2026 - 14:37 UTC

Update

We are currently working on re-establishing connectivity for the supporting systems for the DCD to restore the service, as well.

Posted Mar 03, 2026 - 14:33 UTC

Update

We see that AI Model Hub service is recovering. We are closely monitoring request processing and reducing the severity of the impact for this service.

Posted Mar 03, 2026 - 14:20 UTC

Update

Network connectivity is improving. We will re-enable provisioning and monitor job execution as well as the rest of the affected services.

Posted Mar 03, 2026 - 14:01 UTC

Update

Our network team is still diagnosing ongoing connectivity issues that affect the referenced services.

Posted Mar 03, 2026 - 13:39 UTC

Update

We are upgrading the impact for the AI Model Hub.

Posted Mar 03, 2026 - 13:24 UTC

Update

We are temporarily pausing provisioning in a cluster currently experiencing a backlog of queued jobs. While running services remain unaffected, updates to existing entities will not be processed until provisioning is reactivated.

Posted Mar 03, 2026 - 12:54 UTC

Update

We have added DBaaS in the list of affected services due to increased error count. The team is aware and actively working on the service.

Posted Mar 03, 2026 - 12:45 UTC

Identified

We are seeing issues again on DCD frontend, as well as AI Model Hub. Customers may still see connectivity issues with these services. Our Services Team are investigating.

Posted Mar 03, 2026 - 12:38 UTC

Monitoring

We are seeing services recovering. We will monitor the progress over the next few minutes and update the status of the affected services.

Posted Mar 03, 2026 - 12:09 UTC

Update

We have added DCD and AI Modelhub to the list of affected services. Customers might experience connectivity issues for these services currently.

Posted Mar 03, 2026 - 12:01 UTC

Identified

We have identified an issue related to the ongoing maintenance: https://status.ionos.cloud/incidents/zg4mpk9x724t
Our network team is currently working on restoring service to the affected components. Customers might experience intermittent network service degradation or outages.
We are upgrading the severity of the incident and will keep you informed on the progress of the recovery.

Posted Mar 03, 2026 - 11:53 UTC

Investigating

We are currently investigating monitoring alerts in TXL. We will keep you updated on our investigation.

Posted Mar 03, 2026 - 11:36 UTC

This incident affected: APIs and Frontends (Data Center Designer (DCD)), Global Services (Database as a Service (DBaaS), AI Model Hub), and Location DE/TXL (Network, Provisioning).