What happened?
On 15.01.2026, we experienced a network incident that resulted in a loss of connectivity to key management devices in Berlin, as well as a disruption of Private Cloud management and production traffic.
How did this happen? (Technical Root Cause)
The incident occurred during a planned maintenance window intended to introduce a new VLAN to the management devices.
Human error during command execution led to unexpected behavior: instead of appending the new VLAN, the existing VLANs on the trunk between the core and management devices were overwritten. Consequently, only the new VLAN remained active.
The removal of these VLANs led to the following cascading outages:
- Management Loss: Loss of access to multiple management interfaces in the Berlin region.
- Storage Disruption: The affected management devices hosted interfaces for internal storage clusters. When the associated VLANs became unreachable, the storage clusters were disconnected. During the final stages of recovery, restarting the affected services led to a brief secondary outage.
- Data Integrity Protection: This disconnection forced several VMs and Edge Nodes into read-only mode to prevent data corruption.
- Routing Issues: Since the NSX Edge Nodes (which bridge virtual networks with the physical environment) were impacted, external routing was lost. This also resulted in a loss of connectivity between the virtual machines and the internet.
What are we doing to prevent this from happening again?
To improve network resilience and prevent a recurrence, we have identified the following action items:
- Architecture Review: We will conduct a comprehensive audit of the management device architecture to ensure better segmentation. (Target: Q3 2026)
- VLAN Configuration Mitigation: We will evaluate and implement solutions to prevent unintended VLAN changes, such as improved configuration safeguards or automation scripts. (Target: Q1 2026)