What Happened?
On March 14, 2026 a subset of customer Managed Kubernetes control planes experienced time periods of intermittent, recurring unavailability. The incident persisted until March 16, 2026, when the last anomalies were recorded and the incident was fully mitigated.
During the impact window, affected clusters were periodically unreachable, meaning operations depending on the Kubernetes API - such as deployments, scaling operations, and health checks - would not have worked reliably.
How Was This Possible?
The root cause was a combination of three factors that compounded each other:
- Excessive data volume from a small number of clusters: A subset of clusters were storing unusually large amounts of data in the shared control plane database - primarily security scanning reports and policy audit records, alongside high volumes of event and autoscaling objects. Three clusters alone accounted for approximately 64% of all stored data on one of the affected database instances, putting significant pressure on shared resources.
- Database maintenance frequency: Our control plane databases were configured to compact and reclaim unused space, given the high rate of data being written by the affected clusters, this interval was insufficient to keep up, causing the database to accumulate excessive historical revisions and grow beyond normal operating size.
- Database fragmentation: As the database grew, the actual allocated size exceeded the amount of data actively in use - adding roughly 40% overhead. This pushed the database toward its hard storage limit, at which point it would have become read-only and caused a full control plane failure for all clusters on that instance.
Together, these factors caused periodic stalls during database maintenance cycles, which temporarily disrupted the Kubernetes API server and made affected control planes unavailable for several minutes each time.
What We Are Doing to Prevent Recurrence?
We have already taken immediate action and have a structured plan in place for further improvements::
- Performed defragmentation on the affected database instances, reclaiming approximately 3–4 GiB per instance and restoring healthy operating headroom (DONE)
- Reduced the database compaction intervals on the affected instances (DONE)
- Expanded monitoring dashboards to improve visibility into control plane health per database instance (DONE)
- Migrating the highest-volume clusters to dedicated database backends, eliminating the noisy-neighbor risk for those workloads entirely (DONE)
- Improving alerting to ensure incidents of this nature trigger immediate paging notifications, ensuring timely human intervention. (DONE)
- Rolling out updated compaction interval across all shared control plane database instances (DONE)
- Implementing automated, scheduled defragmentation jobs for all instances to prevent fragmentation from building up (DONE)
- Rebalancing cluster distribution across database instances to reduce concentration risk (DONE)
Short-term (within 4 weeks):
Reaching out directly to customers whose clusters are generating disproportionately high data volumes to discuss workload optimization, such as using external storage backends for large security scan reports instead of storing them in the control plane database (ONGOING)
Medium-term (1–3 months):
Specification of quotas: Introducing per-tenant resource quotas and key-count limits on the shared control plane database to prevent any single cluster from impacting others
Long-term (Q2 2026 and beyond):
- Re-architecting the shared control plane to provide stronger isolation between customer workloads, including dedicated database instances and improved vertical scaling
- Evaluating full database sharding for customers with consistently high data volumes
- Conducting regular operational drills focused on database resource exhaustion and recovery to improve our response time for future incidents