MK8s: Partial Connectivity Degradation to Control Planes

Incident Report for IONOS Cloud

Postmortem

What Happened?

On March 14, 2026 a subset of customer Managed Kubernetes control planes experienced time periods of intermittent, recurring unavailability. The incident persisted until March 16, 2026, when the last anomalies were recorded and the incident was fully mitigated.

During the impact window, affected clusters were periodically unreachable, meaning operations depending on the Kubernetes API - such as deployments, scaling operations, and health checks - would not have worked reliably.

How Was This Possible?

The root cause was a combination of three factors that compounded each other:

  • Excessive data volume from a small number of clusters: A subset of clusters were storing unusually large amounts of data in the shared control plane database - primarily security scanning reports and policy audit records, alongside high volumes of event and autoscaling objects. Three clusters alone accounted for approximately 64% of all stored data on one of the affected database instances, putting significant pressure on shared resources.
  • Database maintenance frequency: Our control plane databases were configured to compact and reclaim unused space, given the high rate of data being written by the affected clusters, this interval was insufficient to keep up, causing the database to accumulate excessive historical revisions and grow beyond normal operating size.
  • Database fragmentation: As the database grew, the actual allocated size exceeded the amount of data actively in use - adding roughly 40% overhead. This pushed the database toward its hard storage limit, at which point it would have become read-only and caused a full control plane failure for all clusters on that instance.

Together, these factors caused periodic stalls during database maintenance cycles, which temporarily disrupted the Kubernetes API server and made affected control planes unavailable for several minutes each time.

What We Are Doing to Prevent Recurrence?

We have already taken immediate action and have a structured plan in place for further improvements::

  • Performed defragmentation on the affected database instances, reclaiming approximately 3–4 GiB per instance and restoring healthy operating headroom (DONE)
  • Reduced the database compaction intervals on the affected instances (DONE)
  • Expanded monitoring dashboards to improve visibility into control plane health per database instance (DONE)
  • Migrating the highest-volume clusters to dedicated database backends, eliminating the noisy-neighbor risk for those workloads entirely (DONE)
  • Improving alerting to ensure incidents of this nature trigger immediate paging notifications, ensuring timely human intervention. (DONE)
  • Rolling out updated compaction interval across all shared control plane database instances (DONE)
  • Implementing automated, scheduled defragmentation jobs for all instances to prevent fragmentation from building up (DONE)
  • Rebalancing cluster distribution across database instances to reduce concentration risk (DONE)

Short-term (within 4 weeks):

Reaching out directly to customers whose clusters are generating disproportionately high data volumes to discuss workload optimization, such as using external storage backends for large security scan reports instead of storing them in the control plane database (ONGOING)

Medium-term (1–3 months):

Specification of quotas: Introducing per-tenant resource quotas and key-count limits on the shared control plane database to prevent any single cluster from impacting others

Long-term (Q2 2026 and beyond):

  • Re-architecting the shared control plane to provide stronger isolation between customer workloads, including dedicated database instances and improved vertical scaling
  • Evaluating full database sharding for customers with consistently high data volumes
  • Conducting regular operational drills focused on database resource exhaustion and recovery to improve our response time for future incidents
Posted Apr 23, 2026 - 14:10 UTC

Resolved

This incident is now marked resolved, as all affected control planes have returned to and stayed in a stable state.
A Root Cause Analysis (RCA) is underway and will be published here once finalized.
Posted Mar 17, 2026 - 17:04 UTC

Monitoring

Final changes have been applied to all afffected clusters resolving the issue. We are now monitoring the progress
Posted Mar 16, 2026 - 19:59 UTC

Update

Final changes have been applied to all afffected clusters resolving the issue. We are now monitoring the progress
Posted Mar 16, 2026 - 19:58 UTC

Update

The changes applied to some control plane clusters have had positive effects. The team is continuing the rollout to other affected clusters.
Posted Mar 16, 2026 - 19:34 UTC

Update

We are marking DBaaS as recovered.
Our Kubernetes Team is currently working on stabilizing the Kubernetes Control Plane. We are focusing on mitigating recurring load spikes influencing stability.
Posted Mar 16, 2026 - 13:56 UTC

Update

We are marking the Container Registry Service as recovered.
Posted Mar 16, 2026 - 12:57 UTC

Update

We are closing the incident for the AI Model Hub. All metrics have recovered and the service should be up and running again normally.
Posted Mar 16, 2026 - 12:22 UTC

Update

We are adding the Container Registry as an affected Service. Customers may currently experience issues pulling and pushing images from the Registry.
Posted Mar 16, 2026 - 11:57 UTC

Update

Our Kubernetes Team has deployed a fix for the affected AI Model Hub Database Services. We currently see metrics improving and monitoring the situation closely.
Posted Mar 16, 2026 - 11:37 UTC

Update

We are expanding the scope of this incident to include DBaaS and AI Model Hub. We have observed an increased error count originating from PostgresDB on Kubernetes. Additionally, to improve transparency, the previously reported separate incident regarding the AI Model Hub (https://status.ionos.cloud/incidents/rmgs845klm32) is being merged into this primary incident.
Posted Mar 16, 2026 - 11:05 UTC

Identified

The team has identified the root cause as a resource constraint within the etcd database. Mitigation efforts are currently underway.
Posted Mar 16, 2026 - 10:04 UTC

Investigating

Some customers may experience connection problems to the control plane and degraded functionality of kubernetes.
Our teams are investigating and working on a resolution.
Posted Mar 16, 2026 - 08:24 UTC
This incident affected: Global Services (Database as a Service (DBaaS), AI Model Hub, Container Registry) and Location DE/FRA (Managed Kubernetes).