What happened?
We experienced a network incident resulting in connectivity issues between our FRA and FRE data centers. This led to a high error rate for services hosted on a segment of S3 hosts in the FRA region.
How could this happen? (Technical Root Cause)
The incident was caused by a software failure within the Border Gateway Protocol (BGP) control plane on a specific network switch. This failure caused a Layer 3 routing failure, which resulted in outgoing traffic from a segment of S3 hosts being blackholed (unexpectedly dropped).
A network redundancy mechanism designed to prevent traffic blackholing in failure scenarios did not activate because the physical network connections remained operational, even though the critical control plane service had failed. Since the physical links did not go down, the redundancy mechanism was bypassed. The issue was ultimately resolved by rebooting the affected network switch.
What are we doing to prevent this from happening again?
To prevent a recurrence and improve network stability, we are implementing the following measures: