Network Service Degradation in TXL

Incident Report for IONOS Cloud

Postmortem

Preliminary Root Cause Analysis

Today we are releasing a preliminary Root Cause Analysis that represents the current state of the investigation focusing on the network incident. While we have high confidence in the technical details concerning the network incident, we will add to this analysis as new information becomes available related to the services affected by the spillover effects of this main incident. We expect to publish the complete RCA by the end of this week.

What happened?

During a scheduled maintenance window to replace a faulty switch in a cluster situated in the TXL data center, a network outage occurred.

The incident triggered a spillover effect, resulting in a significant queue of unprocessed provisioning jobs from the affected cluster. Following network recovery, this backlog generated significant resource locking delays, slowing processing of queued jobs. This impacted systems and services relying on self-healing automations triggered by the primary incident or undergoing changes at that time.

How was this possible? (Technical Root Cause)

The maintenance took place in Topology 2 of 2 in the affected cluster. During the maintenance, Topology 2 of 2 was disabled as planned. At this point, the Multicast Forwarding Table (MFT) capacity across all switches in Topology 1 was already strained due to saturation from IPoIB multicast groups. Due to this, Topology 1 was already operating in a vulnerable state.

Topology 1 of 2 then suffered a critical gateway overload, initiated by a link flap on a switch which flooded the InfiniBand control plane with alerts, triggering a fabric resweep - an automated self-healing mechanism aiming to reset the network map. Because the MFT was already at capacity, the reprogramming required for the reset failed. The resulting surge of unsuccessful route updates overwhelmed the management layer, causing BGP sessions to drop and ultimately severing connectivity for hosts in the affected cluster.

As the connectivity loss made it impossible for provisioning jobs to succeed in the affected cluster all changes scheduled for resources on the affected hosts were queued. 

The network incident was resolved by a switchover from the affected Control Plane to its standby, a reboot of the locked-up gateways, and recovery of Topology 1 from the scheduled maintenance.

Although the network incident was resolved, the resulting volume of provisioning jobs triggered significant resource locking. This bottleneck delayed critical updates, leaving several (self-healing) services in a degraded state as they waited for their queued jobs to process.

What are we doing to prevent recurrence?

While we are still working on compiling all action items for the affected services, we want to share the following items already:

  • Network Infrastructure Isolation: We are moving our control-plane to dedicated, high-performance servers. This is a measure that we already started in Q4 2025 and we will continue this work to ensure better decoupling of this service. By separating network management from data-plane traffic, we eliminate resource competition. This ensures that even under heavy data loads, network routing remains stable and responsive. Additionally, we are performing a deep-dive audit of specific IPv6 configurations and IPoIB driver settings to further harden the network against CPU spikes and MFT capacity exhaustion. (Ongoing, Q3 2026)
  • Network Modernization: Our cloud infrastructure relies on a high-performance interconnect fabric that is currently undergoing a strategic network modernization program – including switches, gateways, and drivers – to significantly increase both resiliency and performance. We are executing the transition in phases to mitigate risk.  (Ongoing, Q3 2026)
  • Enhanced Provisioning Resilience: To prevent "logjams" during maintenance or incidents, we are introducing automated circuit breakers within our provisioning engine to reduce the risk for overloads and perpetual queue build up caused by self-healing services. Several quality of service improvements are planned to improve visibility and control over job execution and prioritization within the provisioning queue. (ETA to come)

We remain committed to transparency and we thank you for your patience that allows us to continue our investigation with due diligence.

Posted Mar 09, 2026 - 18:10 UTC

Resolved

The Provisioning and DBaaS services have been restored to full operational status. The backlog has been cleared and no further service degradation is expected.
Our Network Team is currently preparing a comprehensive Root Cause Analysis (RCA) of this incident, which we will publish on this page. We anticipate releasing the complete analysis by tomorrow.
Thank you for your patience.
Posted Mar 03, 2026 - 21:49 UTC

Monitoring

We are setting the DCD frontend back to 'Operational' now that the dependency on the provisioning side has been resolved. Customers should now be able to use the DCD via the web frontend again.

We are setting the incident to 'Monitoring' status and will ensure the successful recovery of the still affected services. Customers might still experience a performance impact on provisioning-related activities, such as modifying infrastructure resources in the DCD.

We recommend to postpone non-mission critical/urgent changes until the incident is marked as "Resolved"
Posted Mar 03, 2026 - 15:25 UTC

Update

We are still waiting for the provisioning backlog to be processed. Customers might face extended provisioning times. DCD web-frontend availability and DBaaS services remain affected due to their dependency on provisioning assignments. We will continue to update the status page, though likely at a reduced cadence as we monitor backlog consumption.

In parallel, a Root Cause Analysis (RCA) has been initiated for the triggering incident. We will share the RCA here as soon as it becomes available.
Posted Mar 03, 2026 - 15:16 UTC

Update

We are setting AI Model Hub back to "operational"
Posted Mar 03, 2026 - 14:37 UTC

Update

We are currently working on re-establishing connectivity for the supporting systems for the DCD to restore the service, as well.
Posted Mar 03, 2026 - 14:33 UTC

Update

We see that AI Model Hub service is recovering. We are closely monitoring request processing and reducing the severity of the impact for this service.
Posted Mar 03, 2026 - 14:20 UTC

Update

Network connectivity is improving. We will re-enable provisioning and monitor job execution as well as the rest of the affected services.
Posted Mar 03, 2026 - 14:01 UTC

Update

Our network team is still diagnosing ongoing connectivity issues that affect the referenced services.
Posted Mar 03, 2026 - 13:39 UTC

Update

We are upgrading the impact for the AI Model Hub.
Posted Mar 03, 2026 - 13:24 UTC

Update

We are temporarily pausing provisioning in a cluster currently experiencing a backlog of queued jobs. While running services remain unaffected, updates to existing entities will not be processed until provisioning is reactivated.
Posted Mar 03, 2026 - 12:54 UTC

Update

We have added DBaaS in the list of affected services due to increased error count. The team is aware and actively working on the service.
Posted Mar 03, 2026 - 12:45 UTC

Identified

We are seeing issues again on DCD frontend, as well as AI Model Hub. Customers may still see connectivity issues with these services. Our Services Team are investigating.
Posted Mar 03, 2026 - 12:38 UTC

Monitoring

We are seeing services recovering. We will monitor the progress over the next few minutes and update the status of the affected services.
Posted Mar 03, 2026 - 12:09 UTC

Update

We have added DCD and AI Modelhub to the list of affected services. Customers might experience connectivity issues for these services currently.
Posted Mar 03, 2026 - 12:01 UTC

Identified

We have identified an issue related to the ongoing maintenance: https://status.ionos.cloud/incidents/zg4mpk9x724t
Our network team is currently working on restoring service to the affected components. Customers might experience intermittent network service degradation or outages.
We are upgrading the severity of the incident and will keep you informed on the progress of the recovery.
Posted Mar 03, 2026 - 11:53 UTC

Investigating

We are currently investigating monitoring alerts in TXL. We will keep you updated on our investigation.
Posted Mar 03, 2026 - 11:36 UTC
This incident affected: APIs and Frontends (Data Center Designer (DCD)), Global Services (Database as a Service (DBaaS), AI Model Hub), and Location DE/TXL (Network, Provisioning).