What happened?
During a planned infrastructure maintenance window (https://status.ionos.cloud/incidents/rh8l4xmdpdk8) which included a network control plane upgrade and a provisioning software update, two distinct issues led to service disruption for a number of Virtual Machines (VMs) across multiple compute nodes.
- Issue 1 (Configuration Error): A misconfiguration on one of our compute nodes caused the network control plane service to fail. This resulted in a loss of connectivity for VMs situated on the node.
- Issue 2 (Software Bug): A bug in our provisioning software was discovered during a rollout, causing firewall configuration changes to fail on two other nodes. This left some customer network configurations in a stuck state.
How was this possible (technical root cause)?
This particular maintenance was rolled out in multiple datacenters already without issues. In TXL the following circumstances lead to the incident:
- Configuration Management Error: A software package installation done prior to the maintenance resulted in an unexpected update to a dependency on a specific compute node. This prevented the new version of our core networking software from starting correctly after its deployment.
- Provisioning Software Defect: An update to our provisioning system introduced a software defect (inconsistent VNIC entries in the SDN database) that prevented necessary firewall configuration changes from completing successfully, leading to an inconsistency in the network configuration state for the affected virtual private data centers (VDCs).
What will we do to prevent this from happening again?
We are committed to preventing recurrence through the following measures that will ensure that we are able to react better to unexpected configurations during maintenance work:
- Enhanced Rollout Procedures: The issue was spotted with a delay after the maintenance was conducted. We are further improving our update processes by adding more comprehensive pre- and post-maintenance checks, to actively confirm the desired state after the maintenance is finished and prevent confusion (expected vs. unexpected alerts) in alerting and monitoring teams during and after maintenance (DONE)
- Resource Alignment: We will ensure that specialized network development engineers are on standby during all future core network control plane upgrades to address issues immediately. (DONE)
- Change Synchronization: We will align and prevent updates on the same day to core network and virtual private data center provisioning systems to minimize the risk of complex failure patterns. (DONE)
- Software Fix: Our development team is actively working on a hotfix for the identified bug in the provisioning software. (QA Ongoing, Release expected this week)
- Incident Classification and Communication: We are simplifying the classification of incidents and automate the creation of initial Status Page updates to ensure that high impact incidents are classified as such early and are published to stakeholders quicker. (Within Q1 2026)
We are currently rolling out a series of improvements to our network setup in all our datacenters. These upgrades will bring improved stability and performance to all our customers and partners. This incident is connected with this work and we understand that this has caused disruption. We believe that the measures we have taken now will make upcoming rollouts safer while allowing us to continue executing on our improvement roadmap with both speed and diligence.