Object Storage Service Restrictions in 'eu-central-3'

Incident Report for IONOS Cloud

Resolved

All performance metrics have returned to normal operating levels, and we are marking this incident as resolved. The following is the Root Cause Analysis (RCA) for the incident.

What happened:
The following events were identified as key contributing factors:

1. A RAM hardware problem in our storage system caused a slowdown in request processing. A number of processing nodes were purged, which resulted in a high number of "tombstones" in RocksDB within Ceph. This forced Ceph to read significantly more data for every single request, increasing API latency. In some cases, this led to timeouts in connected applications. The problem was limited to the Berlin instance; other locations were not affected.
2. The problem caused by the RAM hardware failure was significantly amplified by a parallel, system-wide, and long-running maintenance procedure that required propagating changes throughout the system.
3. At the same time, excessive LIST queries, which are used in popular backup SDKs, generated a high load that could no longer be adequately distributed in the affected system.

The coincidence of these three events led to a prolonged service disruption and significantly slowed down efforts to recover performance.

What we did:
Our team promptly identified the first root cause (hardware defect) and took measures to optimize our storage system's database. Unfortunately, these measures were significantly slowed down by the ongoing maintenance, during which changes had to be propagated throughout the system. Through targeted maintenance, we were able to significantly reduce the amount of data that needs to be processed for each request.

The result:
Response times have returned to normal, and system performance is fully back to the usual standard.

What we are doing to prevent this from happening again:
-Observability Update: We have enhanced our monitoring systems to detect similar problems earlier (to be finalized by end of Q4).
-System Update: We will perform an update of our storage system (Ceph Squid), which will further improve efficiency and monitoring capabilities (planned for Q4).
-Optimize Load Balancing: We will adjust the configuration to permanently reduce the load on the most affected parts of the system and increase scalability in similar scenarios. This is especially important for the metadata storage, which was also heavily affected in this case. (planned for Q4)

We believe these actions will help prevent a similar scenario in which a hardware failure coincides with prolonged maintenance and simultaneous high load. We understand that this prolonged service degradation has affected many customers and partners. We worked hard to restore our service, and we will work hard to complete every action point derived from this incident to provide you with a reliable service.
Posted Oct 24, 2025 - 07:20 UTC

Monitoring

Overall latency has returned to expected levels. Requests are now working normally again. We continue to monitor the situation to ensure general stability.
Posted Oct 20, 2025 - 06:44 UTC

Update

We can confirm that PUT requests to customer buckets are now functioning normally, with overall latency returning to expected levels. However, latency remains high for bucket listing operations, and we continue actively investigating and addressing this issue to restore full service performance. Our monitoring and system compaction strategies remain in place to ensure stability and optimize performance as we work toward resolution.
Posted Oct 17, 2025 - 11:12 UTC

Update

We are actively monitoring disk bandwidth and latency to quickly detect and resolve any recurring issues. We have implemented a strategy to complete necessary system compactions simultaneously across all hosts to ensure optimal performance. These compactions are designed to run independently and efficiently, minimizing any potential impact on service stability.

We have identified that the root cause of past performance issues was related to how our system handled a large volume of customer data deletions.
Posted Oct 17, 2025 - 09:32 UTC

Update

We've completed initial fixes on critical parts of the system, and other parts are still being optimized. While our overall service speed has improved, we are still seeing some internal delays that we hope to address with further fine-tuning. We will continue to monitor the situation closely and provide updates.
Posted Oct 16, 2025 - 10:24 UTC

Update

We are actively working to resolve an S3 object‑storage latency issue. While customer API response times have improved, we are aiming to restore internal latency to its normal baseline. We are performing system maintenance and continuously monitoring performance to ensure a full resolution. We will keep you informed of our progress.
Posted Oct 16, 2025 - 00:54 UTC

Identified

We are working to resolve an issue affecting the performance of our S3 Object Storage. Our team has identified a potential solution which has significantly improved response times in our initial tests.

We are currently implementing this solution across our storage systems and are closely monitoring the improvements. We are dedicated to ensuring the best possible experience for our users and will keep you informed of our progress.
Posted Oct 15, 2025 - 15:53 UTC

Investigating

We are currently addressing a performance degradation within our cloud storage service, specifically impacting metadata operations. This has resulted in elevated latency, which we are actively working to mitigate.

Our engineering teams are investigating the underlying causes, with initial assessments pointing to inefficiencies in our data management layer. We are exploring multiple resolution strategies, including system optimization and potential architectural adjustments, in collaboration with external experts.

While immediate measures are being implemented, a complete resolution is in progress. We are committed to restoring optimal service performance and will provide further updates as our investigation and remediation efforts advance.
Posted Oct 15, 2025 - 12:10 UTC

Monitoring

We are currently running maintenance to restore the performance of the service. You might still see degraded performance until the maintenance is fully completed,.
Posted Oct 10, 2025 - 11:59 UTC

Investigating

We are currently investigating an increased number of upload and download errors in the use of our Object Storage in the 'eu-central-3' region.
Posted Oct 09, 2025 - 12:38 UTC
This incident affected: Location DE/TXL (Object Storage).