IONOS Container Registry

Incident Report for IONOS Cloud

Postmortem

We want to share the Root Cause Analysis to this incident:

What happened

On 4 March 2026 customers of the IONOS Container Registry experienced 504 Gateway Timeout errors when pushing or pulling container images. Deployments that relied on the registry were blocked.

How was that possible (Root Cause)

The registry runs on IONOS Managed Kubernetes (MK8s) infrastructure. A temporary capacity constraint caused two critical control‑plane components to be placed on the same proxy instance instead of distributing them across separate proxies. This happened despite existing anti-affinity rules. The shared proxy reached its maximum concurrent‑connection limit and stopped accepting new connections. Because all registry traffic to the Kubernetes API traverses this proxy, push and pull operations failed with 504 errors. The migration created the co‑location condition; the connection‑limit exhaustion was the direct trigger.

What we are doing to prevent recurrence

Immediate (completed)

  • Provisioned additional proxy capacity.
  • Relocated the affected control‑plane components onto separate proxy instances, restoring balanced load and ending the 504 errors.

Short‑term

  • Architectural redesign: Redesign registry‑to‑API connectivity so each node uses a dedicated local proxy, eliminating shared‑proxy bottlenecks. Design validated in test environments; production rollout scheduled for Q2 2026.
  • Alert‑threshold review: Adjust alerting thresholds to trigger warnings before proxy connection utilization approaches capacity. Rollout in progress, expected completion Q2 2026.

Mid‑term

  • Load redistribution: Deploy additional infrastructure clusters and redistribute existing registries to ensure no single cluster exceeds safe operating capacity. Automation will continuously balance load as usage grows.

Closing remarks

The outage directly impacted container‑image delivery and delayed customer deployments. We have restored full service and implemented concrete architectural and operational changes to eliminate the identified bottleneck.

Posted Apr 16, 2026 - 09:54 UTC

Resolved

We are marking this incident as resolved because no further issues were found in the setup. A Root‑Cause Analysis (RCA) will be published once the team has completed its analysis.
Posted Mar 04, 2026 - 16:39 UTC

Monitoring

The Kubernetes Team has deployed a mitigation to the issue which involved a version rollback of a component of the K8s control plane. We are currently monitoring the service recovery.
Posted Mar 04, 2026 - 12:22 UTC

Identified

Our Container Registry team has identified an issue in the underlying Kubernetes cluster serving a subset of images. The team is currently working on applying a fix for the issue.
Posted Mar 04, 2026 - 10:23 UTC

Investigating

We are currently investigating an increased error count on the IONOS Container Registry. Customers might be unable to pull images currently.
Posted Mar 04, 2026 - 09:23 UTC