IAM and DCD Availability Issues

Incident Report for IONOS Cloud

Postmortem

We have finished our research into this incident and want so share the following Root Cause Analysis:

What happened?

On January 27, 2026, a storage incident occurred in our Berlin region, leading to I/O performance degradation. This impacted the availability of management and customer-facing services.

How did this happen? (Technical Root Cause)

The incident was triggered during a scheduled maintenance window involving updates to parts of our supporting infrastructure..

Initialization Failure: Following a node reboot, several storage daemons failed to reconnect due to a configuration mismatch stemming from a previous hardware replacement.
Operational Error: During the troubleshooting process, a manual command was executed to clear stale storage entries. This command inadvertently removed active storage components from other nodes while the cluster was already in a vulnerable, degraded state.
Performance Impact: The loss of these additional components resulted in the temporary unavailability of specific data segments. This caused a "hang" in I/O operations for some management services relying on that storage, leading to timeouts and service interruptions. Unexpectedly, two failover instances did not behave nominally, which led to the short-term impact on the IAM (auth management) system. Full functionality was restored once engineers recovered the missing storage components and initiated a cluster-wide data rebalancing.

At no point systems that are handling or storing customer data were directly involved in this incident, however, our IAM services were affected by the performance impact of the incident, leading to customer facing service outage for the DCD login.

What are we doing to prevent this from happening again?

To increase the resilience of our internal supporting infrastructure and minimize the impact of human error, we are reviewing and strengthening the following measures:

Refinement of Hardware Procedures: We are updating our disk replacement and decommissioning workflows to include more robust verification steps, ensuring hardware identifiers remain consistent through system reboots.
Guardrails for Management Commands: We are implementing software safeguards and updated standard operating procedures (SOPs) to restrict high-impact cluster commands while a storage environment is already showing signs of degradation.
Redundancy Improvements: We are conducting a review of services that experienced failover issues to ensure that a local disruption does not impact global service availability.

Posted Feb 02, 2026 - 16:16 UTC

Resolved

We are marking this incident as resolved. We will compile a Root Cause Analysis and will publish it as an update to this incident.

Posted Jan 27, 2026 - 18:55 UTC

Monitoring

We have recovered the affected Object Storage and are currently monitoring the recovery of affected systems and services.

Posted Jan 27, 2026 - 18:24 UTC

Update

We have recovered the affected Object Storage and are currently monitoring the recovery of affected systems and services.

Posted Jan 27, 2026 - 18:24 UTC

Update

We have traced the issue to an internal object storage system. Our infrastructure team is currently working on recovery. At this time, users may continue to experience DCD login issues, IAM service disruptions, and delays or errors related to provisioning. We see no evidence of further user-facing network issues and are updating the status accordingly.

Posted Jan 27, 2026 - 17:57 UTC

Identified

We have identified a connectivity issue preventing our identity service to reach an upstream database. Our Operations Team is currently working to mitigate the connectivity issue. We will be updating this page regularly to keep you informed about our progress.

Posted Jan 27, 2026 - 17:08 UTC

Update

We are continuing to investigate this issue.

Posted Jan 27, 2026 - 16:37 UTC

Investigating

We are currently investigating alerts related to DCD and IAM service availability in TXL DC

Posted Jan 27, 2026 - 16:35 UTC

This incident affected: APIs and Frontends (Data Center Designer (DCD)) and Location DE/TXL (Network, Provisioning).