Incident Summary
On March 2, 2026, some Managed Kubernetes clusters in our Frankfurt (DE/FRA) region became stuck in an UPDATING or DEPLOYING state. This caused intermittent communication issues between control plane components and worker nodes.
Root Cause
The instability was caused by our control plane management system running out of memory and crashing under high load. When the system restarted, it was unable to properly resume its interrupted tasks, which left several clusters stuck mid-update.
Resolution
Our engineering teams quickly stabilized the system by increasing its memory allocation. Once stable, engineers manually recovered the remaining clusters that were stuck to restore full service.
Prevention
To prevent this from happening again, we are working to address an upstream software bug related to how the system handles memory-related crashes. Additionally, we are fixing an internal alerting issue that failed to notify our on-call team, ensuring a much faster response should a similar issue arise in the future.