We have finished our research into this incident and want so share the following Root Cause Analysis:
What happened?
On January 27, 2026, a storage incident occurred in our Berlin region, leading to I/O performance degradation. This impacted the availability of management and customer-facing services.
How did this happen? (Technical Root Cause)
The incident was triggered during a scheduled maintenance window involving updates to parts of our supporting infrastructure..
Initialization Failure: Following a node reboot, several storage daemons failed to reconnect due to a configuration mismatch stemming from a previous hardware replacement.
Operational Error: During the troubleshooting process, a manual command was executed to clear stale storage entries. This command inadvertently removed active storage components from other nodes while the cluster was already in a vulnerable, degraded state.
Performance Impact: The loss of these additional components resulted in the temporary unavailability of specific data segments. This caused a "hang" in I/O operations for some management services relying on that storage, leading to timeouts and service interruptions. Unexpectedly, two failover instances did not behave nominally, which led to the short-term impact on the IAM (auth management) system. Full functionality was restored once engineers recovered the missing storage components and initiated a cluster-wide data rebalancing.
At no point systems that are handling or storing customer data were directly involved in this incident, however, our IAM services were affected by the performance impact of the incident, leading to customer facing service outage for the DCD login.
What are we doing to prevent this from happening again?
To increase the resilience of our internal supporting infrastructure and minimize the impact of human error, we are reviewing and strengthening the following measures: