What happened?
On May 8, 2026 between 15:22 and 16:12 (UTC+2) a routine internal operation to onboard a new Managed Kubernetes development cluster caused the deployment automation system to unintentionally target and modify additional clusters and install applications meant for the development cluster.
The applications deployed to affected clusters could not be executed, as the required image pull secrets were not present in those environments, preventing any images from being pulled or run. Once the full scope of the incident was established, a recovery script was developed, validated against our staging environment, and executed in production. All unintended changes have been identified and removed by May 15, 2026.
How was this possible? (Root Cause)
The incident was caused by a software bug in our platform framework onboarding operator, compounded by the absence of a configuration completeness gate in the deployment pipeline.
The onboarding operator is responsible for registering clusters with our deployment system. It relies on a scoping configuration file to restrict its operations to a defined set of target clusters. As part of the planned onboarding of the new development cluster, this configuration file was to be automatically propagated to the deployment branch. The propagation failed silently due to a merge conflict - the file never landed on the target branch.
Simultaneously, a second change removed the onboarding operator from the exclusion list, triggering an immediate deployment. As the deployment system was not configured properly to react to missing configurations - rather than halting the sync - the operator was deployed without any scoping restriction. In this unconstrained state, it treated further cluster-api based clusters as valid onboarding targets.
The root cause is a structural gap in the CI/CD pipeline: the mechanisms to verify that all required configuration files had been successfully delivered to the target branch before a deployment was permitted proved inadequate. The cherry-pick failure was logged but did not block the pipeline. This made it possible to deploy a fully functional but unscoped operator under conditions that appeared normal to the team monitoring the rollout.
What are we doing to prevent recurrence?
Upon detection, the two immediate technical failures were resolved. The onboarding operator has been corrected and the deployment system configuration has been hardened. In parallel, the recovery effort - which included a full staging test cycle before any production changes - identified and cleaned up all affected clusters within the week following the incident.
Immediate Technical Actions
Structural Improvements:
Procedural Improvements
Affected customers were not properly kept informed about the cleanup efforts which compounded the uncertainty. Improvements to the incident communication process are currently being implemented into the wider incident management process ensuring that scalable options exist for tech teams to inform individually affected customers in a timely manner about the status of post-incident remediation efforts. (ETA: Q3 2026)
Closing remarks
We recognize that unintended modification of cluster configurations represents a failure in the controls that should prevent our internal operations from influencing customer environments.
The fixes now in place and planned will close the immediate technical and procedural shortcomings. The structural measures will further harden the Change Process reducing the likelihood and potential fallout of maintenance related incidents further.
Thank you for your patience during the incident and the remediation efforts.