Limited access to provisioning services, DCD, Managed Kubernetes

Incident Report for IONOS Cloud

Postmortem

In the following Root Cause Analysis, we explain the provisioning degradation that affected IONOS Cloud customers between May 3 and May 6, 2026, identify the technical causes, and outline the measures we are taking to prevent a recurrence.

What happened?

Between May 3 04:07 UTC and May 6 approximately 15:00 UTC, the IONOS Cloud provisioning system experienced repeated degradation across all locations, making infrastructure provisioning requests slow or unavailable for approximately 36 hours. Managed Kubernetes (MKS) clusters were placed in read-only maintenance mode for several hours on May 4, preventing cluster operations during that period. This led to customer-facing service degradation and in some cases to availability disruptions due to delayed MKS operations.

The incident involved two distinct but related failure modes that overlapped on May 4:

Provisioning queue congestion: The provisioning queue grew under concentrated load until a feedback loop in the queue processor became self-sustaining, stalling infrastructure change operations.
API pod restarts: From mid-morning UTC on May 4, several Cloud API pods entered a continuous restart cycle. Database connection pools and request-handling thread pools were exhausted, causing the pods' health checks to time out repeatedly. The pods recovered automatically once the provisioning queue load dropped in the early afternoon.

To understand the causes, it is helpful to provide context on how the IONOS Cloud provisioning system processes infrastructure changes.

All infrastructure operations - whether initiated through the DCD, the Cloud API, or automated platform maintenance - are submitted to a provisioning queue. The queue processes these operations in order, acquiring a lock on the relevant infrastructure components to ensure consistency. This serialization is necessary to maintain the integrity of customer infrastructure across all locations. A dedicated queue processor is responsible for working through queued items. Under normal operating conditions, this architecture performs reliably. However, it contains a structural weakness that this incident exposed: the cost of processing each incoming queue signal grows with the current queue depth. Under sustained high load, this creates a feedback loop - the deeper the queue, the slower the processor, and the slower the processor, the deeper the queue grows.

How was this possible?

IONOS Cloud's Managed Kubernetes service performs regular automated maintenance on customer clusters, including snapshot cleanup operations. These snapshot operations are handled through a platform-wide processing lane, as snapshots are not bound to individual customer virtual datacenters at the storage layer. This means that snapshot work from all MKS customers - regardless of which contract or cluster it originates from - is serialized through the same queue lane.

In the early hours of May 3, a large number of customer contracts ran snapshot cleanup jobs in the same overnight window, each contributing a substantial batch of delete operations to this shared processing lane. The combined workload was sufficient to push the queue processor past the threshold at which the feedback loop becomes self-sustaining: the queue could no longer process the baseline load of provisioning jobs fast enough to reduce the queue depth, creating a buildup of provisioning jobs, which in turn added further processing load.

As the queue depth grew, the processor began to fall behind. Managed Kubernetes controllers, observing slow API responses, legitimately retried their requests - amplifying the inflow severalfold at peak load.

The queue feedback loop

The queue message handler performs a lookup of pending items per datacenter each time a processing signal is received. At the queue depth reached during the incident, this operation was orders of magnitude more expensive than under normal conditions. Concurrently, certain internal completion signals bypassed the deduplication logic that normally moderates message inflow, adding load to the processor's message queue directly. Both factors compound: the cost of each handling step grows with the backlog, and the rate of incoming signals also grows with the backlog. Past a threshold, the processor can no longer drain its workload faster than new signals arrive.

The first phase on May 3 partially self-resolved when the MKS maintenance batch completed, reducing inflow and allowing the feedback loop to unwind naturally. The second phase on May 4 did not self-resolve, because MKS cluster auto-healing continued generating new snapshot and storage operations against the same already-saturated queue lane.

API pod resource exhaustion

From mid-morning UTC on May 4, Cloud API pods began failing their health checks and entering a restart cycle. The sustained queue backpressure caused API request-handling threads and database connections to be held open for extended periods while waiting for the queue to accept and process their operations. This depleted the connection pool and thread pool resources available to the pods, resulting in failed health checks.

Restarting and re-scaling the pods provided only temporary improvement, as the underlying backpressure from the saturated queue immediately re-exhausted the available resources. The pods stabilized fully only after the queue cleared in the early afternoon of May 4.

The pod-side resource exhaustion is being treated as a distinct failure mode with its own contributing factors under active analysis. The connection between the queue backpressure and the pod-side exhaustion is confirmed.

Resolution

The following actions were taken to stabilize the system:

Managed Kubernetes was placed in read-only maintenance mode to halt inbound snapshot and storage operations
Scheduled MKS maintenance jobs were disabled for the evening of May 4
MKS cluster auto-healing was temporarily disabled to prevent failed clusters from re-submitting operations to the saturated queue
Cloud API pods were restarted and re-scaled in a recovery attempt; the pods recovered fully once queue load dropped
The queue drained and normal operation was restored in the afternoon of May 6 with immediate measures implemented taking effect.

What we are doing to prevent recurrence

We have identified structural weaknesses in the provisioning queue processor, the Managed Kubernetes maintenance scheduling, and the API pod resource handling that contributed to this incident. The following measures address both the immediate defects and the underlying architectural risk, and complement the ongoing efforts to improve both the provisioning system and the Managed Kubernetes service.

Already completed

Signal deduplication: All incoming signals now flow through the deduplication layer, reducing the rate at which the processor can be flooded.
Queue processor cost optimization: The message handler has been modified to avoid the operations whose cost scaled with queue depth and made the feedback loop possible.

Short-term (ETA July 2026)

Queue management tooling: We are building additional operational tooling to allow the provisioning queue to be inspected, paused, or selectively drained during incidents. This work has already been started in Q1, but the toolset needs to be extended and further automated. Queue switchover was recently tested successfully.
MKS maintenance scheduling: We are introducing jitter and spread mechanisms for MKS maintenance schedules to prevent simultaneous snapshot cleanup across multiple contracts from converging on the shared processing lane.
Ingress deduplication: We are implementing request deduplication at the API ingress layer to prevent legitimate client retries from amplifying load during periods of queue congestion.
API pod resource isolation: We are investigating the conditions that led to connection pool and thread pool exhaustion in the API pods under queue backpressure, with the goal of ensuring that resource contention in the provisioning queue cannot cascade into pod-level health failures.

Mid-term (ETA Q3 2026)

Worker sharding: We are introducing worker sharding for workloads from unrelated contracts and regions, so that a concentration of maintenance work in one lane cannot affect provisioning for other customers.
Queue processor resilience: We are implementing architectural changes to the queue processor, including a bounded message queue and global signal coalescing, to prevent the processor from being overloaded regardless of inflow rate.

Long-term

Provisioning system re-architecture: The provisioning system's current architecture was designed for a different scale and traffic pattern than what IONOS Cloud operates at today. We are actively pursuing a full re-design that will eliminate this class of scalability constraint.

Closing remarks

We recognize that a multi-day degradation of provisioning availability and service quality is a significant impact. Customers relying on infrastructure operations during this period experienced delays, timeouts, and in some cases temporary unavailability of Managed Kubernetes cluster components. The recurring and extended nature of the incident - partially resolving, then worsening again - and the compounding effect of the API pod restarts made the overall impact particularly disruptive and difficult to reliably resolve.

We are addressing the structural weaknesses in the provisioning queue processor that were the root cause for this incident for some time already. The specific combination of conditions between May 3 and 6 brought additional factors to the surface and highlighted what we still need to address while we are work on the architecture continues.

Thank you for your patience during the incident and during the time it has taken to complete this analysis.

Posted Jun 05, 2026 - 17:03 UTC

Resolved

The incident is over, Provisioning is back working normally.
We are working on the RCA and publish it here asap.

Posted May 06, 2026 - 05:55 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 04, 2026 - 15:33 UTC

Investigating

On our ongoing provisioning resolution work, we had to set the Managed K8s service to "read-only",
means no Cluster or Nodepool or Autoscale actions are possible at the moment,
while the running workloads are not affected.
We will let you know asap when the K8s system is back.

Posted May 04, 2026 - 13:04 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 04, 2026 - 12:00 UTC

Update

We are continuing to work on a fix for this issue.

Posted May 04, 2026 - 11:42 UTC

Update

Provisioning was enabled again, however we are still working on the overall resolution.

Posted May 04, 2026 - 11:26 UTC

Update

We temporarily deactivated provisioning in order to fix the underlying issue.

Posted May 04, 2026 - 10:18 UTC

Update

We are continuing to work on a fix for this issue.

Posted May 04, 2026 - 09:53 UTC

Update

We are continuing to work on a fix for this issue.

Posted May 04, 2026 - 09:22 UTC

Update

We are continuing to work on a fix for this issue.

Posted May 04, 2026 - 08:45 UTC

Identified

Currently, there is an increased processing time for provisioning orders initiated via Data Center Designer or API.

Customers may experience timeouts or errors in some provisioning-related operations.

Occasionally, connections to the Data Center Designer may be interrupted.

Posted May 04, 2026 - 04:10 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 03, 2026 - 21:22 UTC

Identified

Currently, there is an increased processing time for provisioning orders that are initiated via Data Center Designer or API.

Occasionally, connections may be lost in the direction of the Data Center Designer.

Availability and accessibility of your virtual data center resources will remain unaffected.
We will inform you as soon as the functionality has been restored.

Posted May 03, 2026 - 18:53 UTC

This incident affected: APIs and Frontends (Data Center Designer (DCD), Cloud API), Location FR/PAR (Managed Kubernetes, Provisioning), Location US/MCI (Managed Kubernetes, Provisioning), Location DE/TXL (Managed Kubernetes, Provisioning), Location DE/FRA (Managed Kubernetes, Provisioning), Location DE/FKB (Managed Kubernetes, Provisioning), Location GB/LHR (Managed Kubernetes, Provisioning), Location US/EWR (Managed Kubernetes, Provisioning), Location US/LAS (Managed Kubernetes, Provisioning), Location ES/VIT (Managed Kubernetes, Provisioning), Location GB/BHX (Managed Kubernetes), and Location DE/FRA/2 (Managed Kubernetes, Provisioning).