K8s Control Planes not available

Incident Report for IONOS Cloud

Postmortem

What happened?

On May 8, 2026 between 15:22 and 16:12 (UTC+2) a routine internal operation to onboard a new Managed Kubernetes development cluster caused the deployment automation system to unintentionally target and modify additional clusters and install applications meant for the development cluster.

The applications deployed to affected clusters could not be executed, as the required image pull secrets were not present in those environments, preventing any images from being pulled or run. Once the full scope of the incident was established, a recovery script was developed, validated against our staging environment, and executed in production. All unintended changes have been identified and removed by May 15, 2026.

How was this possible? (Root Cause)

The incident was caused by a software bug in our platform framework onboarding operator, compounded by the absence of a configuration completeness gate in the deployment pipeline.

The onboarding operator is responsible for registering clusters with our deployment system. It relies on a scoping configuration file to restrict its operations to a defined set of target clusters. As part of the planned onboarding of the new development cluster, this configuration file was to be automatically propagated to the deployment branch. The propagation failed silently due to a merge conflict - the file never landed on the target branch.

Simultaneously, a second change removed the onboarding operator from the exclusion list, triggering an immediate deployment. As the deployment system was not configured properly to react to missing configurations - rather than halting the sync - the operator was deployed without any scoping restriction. In this unconstrained state, it treated further cluster-api based clusters as valid onboarding targets.

The root cause is a structural gap in the CI/CD pipeline: the mechanisms to verify that all required configuration files had been successfully delivered to the target branch before a deployment was permitted proved inadequate. The cherry-pick failure was logged but did not block the pipeline. This made it possible to deploy a fully functional but unscoped operator under conditions that appeared normal to the team monitoring the rollout.

What are we doing to prevent recurrence?

Upon detection, the two immediate technical failures were resolved. The onboarding operator has been corrected and the deployment system configuration has been hardened. In parallel, the recovery effort - which included a full staging test cycle before any production changes - identified and cleaned up all affected clusters within the week following the incident.

Immediate Technical Actions

Operator Scope Restriction: The onboarding operator logic has been corrected to require explicit namespace configuration before it can act on any clusters. It can no longer run unscoped against cluster-api based clusters by default. (DONE)
Deployment Configuration Safety: The deployment system has been reconfigured to fail deployments when required configuration files are absent, replacing the previous behavior that allowed deployments to proceed silently with an incomplete configuration. (DONE)
Cluster Cleanup: A recovery script was developed, validated in staging, and executed across all affected production clusters. All unintended installations and updates have been removed from all affected clusters. (DONE, completed May 15, 2026)

Structural Improvements:

Pipeline Configuration Gate: We are implementing a mandatory verification step in the CI/CD pipeline to confirm that all required configuration files have been successfully propagated to the target branch before any deployment is triggered. (ETA: Q3 2026)
Change Management Controls: We are enforcing additional mandatory peer review and change management approvals for all deployments that touch production systems independent of change size, closing the process gap that allowed this change to proceed without the required additional oversight. This is done in addition to the already established Change Approval Board review process. (ETA: Q3 2026)

Procedural Improvements

Affected customers were not properly kept informed about the cleanup efforts which compounded the uncertainty. Improvements to the incident communication process are currently being implemented into the wider incident management process ensuring that scalable options exist for tech teams to inform individually affected customers in a timely manner about the status of post-incident remediation efforts. (ETA: Q3 2026)

Closing remarks

We recognize that unintended modification of cluster configurations represents a failure in the controls that should prevent our internal operations from influencing customer environments.

The fixes now in place and planned will close the immediate technical and procedural shortcomings. The structural measures will further harden the Change Process reducing the likelihood and potential fallout of maintenance related incidents further.

Thank you for your patience during the incident and the remediation efforts.

Posted Jun 15, 2026 - 09:32 UTC

Resolved

We are marking this incident as resolved. A RCA will be published here once it is compiled.

Posted May 11, 2026 - 08:30 UTC

Monitoring

The rollback was completed successfully. The team is currently checking the affected resources for potential residual inconsistencies. We are setting this incident into monitoring status.

Posted May 08, 2026 - 20:42 UTC

Update

We are currently rolling out the correction. We expect this to take several minutes and will update this page once the rollout is completed.

Posted May 08, 2026 - 18:42 UTC

Update

The team is now preparing to safely rollback identified changes on affected resources.

Posted May 08, 2026 - 17:13 UTC

Update

The team continues to audit affected resources to verify that all remaining inadvertent configuration changes are reverted.

Posted May 08, 2026 - 16:29 UTC

Identified

Our Kubernetes engineering team has identified the root cause of the issue. The incident stemmed from a defect in a configuration filtering mechanism. We are currently implementing mitigation steps and will provide further updates on our progress here.

Posted May 08, 2026 - 15:52 UTC

Investigating

Some customers may experience connection problems to the control plane and degraded functionality of kubernetes.
Our engineers are working on a fix.

Posted May 08, 2026 - 15:15 UTC

This incident affected: Location DE/TXL (Managed Kubernetes), Location DE/FRA (Managed Kubernetes), and Location DE/FRA/2 (Managed Kubernetes).