The AWS Outage of October 20th 2025
What companies can learn from the year's biggest cloud outage
Race Conditions: When Systems Block Themselves
On October 20th 2025, the internet came to a standstill – at least it felt that way. From Fortnite to Zoom, from Signal to Amazon: An outage at Amazon Web Services (AWS) crippled large parts of the internet. After hours of disruptions, all services have been running again since Tuesday. But what exactly happened? And what can we learn from it?
Sven Köhler, Sales Account Executive for AWS business at eggs unimedia, and Simon Bönisch, Product Owner and primary technical lead for the AWS business field, analyzed and contextualized the incident in a video call conducted in German. The conversation between the sales expert and the technical lead provides a rare insight into the complexity of modern cloud infrastructures – and shows why even the best systems are not immune to outages.
The Full Conversation as Video (German Audio)
The Analysis
(Translated from German)
Sven: Simon, suddenly nothing worked anymore – from Fortnite to Zoom. An outage at AWS crippled large parts of the internet. After hours of disruptions on Monday, all services are running again by Tuesday, according to the Handelsblatt headline. What happened? Was it really a security incident, as many are writing? Was AWS hacked?
Simon: No, it wasn't an attack. The rumor did spread relatively quickly, but there's now a postmortem from AWS in which they explain in detail what happened. It was a race condition in the DNS system – more precisely, in an automated system. Two processes tried to make changes simultaneously, and in doing so, the internal DNS entries were lost. The table was suddenly empty. DNS is the addressing system in network technology, and without these addresses, the servers and applications no longer knew how to communicate with each other.
Sven: Race condition – what exactly is that again? And can that happen more often?
Simon: A race condition occurs when two processes simultaneously do things in different regions and get in each other's way. It's really a fault tolerance issue. In this case, it resulted in the address book suddenly being empty. AWS immediately disabled the automation and is now building in additional safeguards to prevent this from happening again. Such errors are extremely difficult to find and debug because they only occur under very specific conditions.
Sven: Which services were affected?
Simon: The primary affected service was DynamoDB, a database service at AWS that is also used internally by AWS itself. That was the real problem – a cascade. After DynamoDB went down, EC2 and Lambda were also affected. These are very central services for virtualized servers and the execution of serverless code. The actual outage at DynamoDB lasted only about an hour and a half until they got the DNS problem under control. But then it took more time for all dependent services to come back online.
Sven: So this chain reaction is what made it take so long in the end?
Simon: Exactly. EC2 and Lambda internally also use a database, and in this case it was DynamoDB. When that went down, queues filled up and caches overflowed. Then you usually have to intervene manually. You may have read that, for example, messengers like Signal went down. Users then constantly try to log in, which leads to a flood of requests. Manual intervention is needed to block the requests and slowly bring the system back online.
Sven: Why does DynamoDB have such a critical status that when it fails, everything goes down?
Simon: DynamoDB is used by management services. EC2 has to internally store a few things in the database, as does Lambda and the API Gateway. Various AWS services use it, and those are in turn used by other services. The system complexity in these huge cloud systems is simply extremely high. If one part fails and everything depends on each other, it cascades through the entire system. It's the same with smaller applications – we know that from our own development. But the larger and more complex the system, the more difficult such problems can become.
Sven: That can happen, but probably shouldn't, right? The ridicule toward AWS is pretty substantial. How does it look with other cloud providers? Can something like that happen there too?
Simon: It can happen, shouldn't – that's true. But with such super complex systems, it's actually almost impossible to avoid something like this happening occasionally. In this case, it wasn't even a human error, but this race condition – two systems working in parallel and then competing with each other. These are errors that are extremely difficult to find.
What's important is that the respective providers have clean processes in place to respond to it. And in that regard, AWS, through its long experience and this massive system that they've been operating for many years, is actually relatively leading. They responded very quickly, communicated clearly, handled the whole thing transparently, and got it under control within a few hours. If you look at the technical details, that's quite impressive.
For us as a partner, such outages are also interesting. There are these post-event summaries from AWS in which they explain in great detail what happened. It's always instructive when you get insight into the architecture and see how such a huge system is operated.
Sven: What can we personally learn from this? Are there things we should do differently now?
Simon: What you can learn from this is that architectural fundamentals like high availability and multi-region deployment have very good reasons. Something can always happen again. In this case, mainly the US-East-1 region was affected. If you had operated a web application exclusively in this region, the problem would have been even bigger. Of course, it also cascaded globally because US-East-1 is a very central region for Amazon's internal services. This again shows that topics like resilience should be considered in application planning and operations.
Sven: Are there any criticisms of AWS?
Simon: The general criticism is, of course, the strong dependence on such systems in the modern world. That is a problem. Personally, however, I don't believe you can escape that through European cloud providers. Eventually they will have the same problems because they too will eventually operate very complex systems.
You can pay attention to this in product design. I read something curious: luxury mattresses that automatically regulate temperature failed because they needed a constant internet connection. Some of them got stuck at 40 degrees. Or beds that adjusted themselves to prevent snoring – and then the cloud went down and the bed couldn't be lowered anymore. You should pay attention to things like that in product design. But otherwise, I believe that you can't really avoid such outages.
Sven: Now I know why I slept so restlessly that night! Thank you for the analysis, Simon.
Conclusion: Resilience Instead of Perfection
The AWS outage of October 20, 2025 shows: Even the best systems are not infallible. A race condition, a rare technical error, was enough to cripple services worldwide. The most important insight: It's not about completely avoiding outages, but about limiting them and responding to them quickly. Multi-region architectures, diversified dependencies, and disciplined incident readiness are the way forward.
For us at eggs unimedia, it's clear: With our long-standing AWS expertise and our project experience, we are well positioned to guide customers through such situations and develop resilient cloud architectures.