In the following Root Cause Analysis, we explain the provisioning degradation that affected IONOS Cloud customers between May 3 and May 6, 2026, identify the technical causes, and outline the measures we are taking to prevent a recurrence.
What happened?
Between May 3 04:07 UTC and May 6 approximately 15:00 UTC, the IONOS Cloud provisioning system experienced repeated degradation across all locations, making infrastructure provisioning requests slow or unavailable for approximately 36 hours. Managed Kubernetes (MKS) clusters were placed in read-only maintenance mode for several hours on May 4, preventing cluster operations during that period. This led to customer-facing service degradation and in some cases to availability disruptions due to delayed MKS operations.
The incident involved two distinct but related failure modes that overlapped on May 4:
To understand the causes, it is helpful to provide context on how the IONOS Cloud provisioning system processes infrastructure changes.
All infrastructure operations - whether initiated through the DCD, the Cloud API, or automated platform maintenance - are submitted to a provisioning queue. The queue processes these operations in order, acquiring a lock on the relevant infrastructure components to ensure consistency. This serialization is necessary to maintain the integrity of customer infrastructure across all locations. A dedicated queue processor is responsible for working through queued items. Under normal operating conditions, this architecture performs reliably. However, it contains a structural weakness that this incident exposed: the cost of processing each incoming queue signal grows with the current queue depth. Under sustained high load, this creates a feedback loop - the deeper the queue, the slower the processor, and the slower the processor, the deeper the queue grows.
How was this possible?
IONOS Cloud's Managed Kubernetes service performs regular automated maintenance on customer clusters, including snapshot cleanup operations. These snapshot operations are handled through a platform-wide processing lane, as snapshots are not bound to individual customer virtual datacenters at the storage layer. This means that snapshot work from all MKS customers - regardless of which contract or cluster it originates from - is serialized through the same queue lane.
In the early hours of May 3, a large number of customer contracts ran snapshot cleanup jobs in the same overnight window, each contributing a substantial batch of delete operations to this shared processing lane. The combined workload was sufficient to push the queue processor past the threshold at which the feedback loop becomes self-sustaining: the queue could no longer process the baseline load of provisioning jobs fast enough to reduce the queue depth, creating a buildup of provisioning jobs, which in turn added further processing load.
As the queue depth grew, the processor began to fall behind. Managed Kubernetes controllers, observing slow API responses, legitimately retried their requests - amplifying the inflow severalfold at peak load.
The queue feedback loop
The queue message handler performs a lookup of pending items per datacenter each time a processing signal is received. At the queue depth reached during the incident, this operation was orders of magnitude more expensive than under normal conditions. Concurrently, certain internal completion signals bypassed the deduplication logic that normally moderates message inflow, adding load to the processor's message queue directly. Both factors compound: the cost of each handling step grows with the backlog, and the rate of incoming signals also grows with the backlog. Past a threshold, the processor can no longer drain its workload faster than new signals arrive.
The first phase on May 3 partially self-resolved when the MKS maintenance batch completed, reducing inflow and allowing the feedback loop to unwind naturally. The second phase on May 4 did not self-resolve, because MKS cluster auto-healing continued generating new snapshot and storage operations against the same already-saturated queue lane.
API pod resource exhaustion
From mid-morning UTC on May 4, Cloud API pods began failing their health checks and entering a restart cycle. The sustained queue backpressure caused API request-handling threads and database connections to be held open for extended periods while waiting for the queue to accept and process their operations. This depleted the connection pool and thread pool resources available to the pods, resulting in failed health checks.
Restarting and re-scaling the pods provided only temporary improvement, as the underlying backpressure from the saturated queue immediately re-exhausted the available resources. The pods stabilized fully only after the queue cleared in the early afternoon of May 4.
The pod-side resource exhaustion is being treated as a distinct failure mode with its own contributing factors under active analysis. The connection between the queue backpressure and the pod-side exhaustion is confirmed.
Resolution
The following actions were taken to stabilize the system:
What we are doing to prevent recurrence
We have identified structural weaknesses in the provisioning queue processor, the Managed Kubernetes maintenance scheduling, and the API pod resource handling that contributed to this incident. The following measures address both the immediate defects and the underlying architectural risk, and complement the ongoing efforts to improve both the provisioning system and the Managed Kubernetes service.
Already completed
Short-term (ETA July 2026)
Mid-term (ETA Q3 2026)
Long-term
Closing remarks
We recognize that a multi-day degradation of provisioning availability and service quality is a significant impact. Customers relying on infrastructure operations during this period experienced delays, timeouts, and in some cases temporary unavailability of Managed Kubernetes cluster components. The recurring and extended nature of the incident - partially resolving, then worsening again - and the compounding effect of the API pod restarts made the overall impact particularly disruptive and difficult to reliably resolve.
We are addressing the structural weaknesses in the provisioning queue processor that were the root cause for this incident for some time already. The specific combination of conditions between May 3 and 6 brought additional factors to the surface and highlighted what we still need to address while we are work on the architecture continues.
Thank you for your patience during the incident and during the time it has taken to complete this analysis.