HN Reader

More than DNS: Learnings from the 14 hour AWS outage

174

Aside from not dogfooding, what would have reduced the impact? Because "don't have a bug" is .. well it's the difference between desire and reality.

Not dogfooding is the software and hardware equivalent of the electricity network "black start" -you never want to be there, but somewhere in the system you need a honda petrol generator, which is enough to excite the electromagnets on a bigger generator, which you spin up to start the turbine spinning until the steam takes up load and the real generator is able to get volts onto the wire.

Pumped Hydro is inside the machine. It's often held out as the black-start mechanism because it's gravity, there's less to go wrong, but if we are in 'line up the holes in the cheese grater' space, you can always have 'want of a nail' issues with any mechanism. The honda generator can have a hole in the petrol tank, the turbine at the pumped hydro can be out for maintenance.

8 days agoby ggm

Good to see an analysis emphasizing the metastable failure mode in EC2, rather than getting bogged down by the DNS/Dynamo issue. The Dynamo issue, from their timeline, looks like it got fixed relatively quickly, unlike EC2, which needed a fairly elaborate SCRAM and recovery process that took many hours to execute.

A faster, better-tested "restart all the droplet managers from a known reasonable state" process is probably more important than finding all the Dynamo race conditions.

6 days agoby tptacek

> What is surprising is that a classic Time-of-check-time-of-use (TOCTOU) bug was latent in their system until Monday. Call me naive, but I thought AWS, an organization running a service that averages 100 million RPS, would have flushed out TOCTOU bugs from their critical services.

Yeah, right. I'm surprised how anyone involved with software engineering can be surprised by this. I would argue that there many, if not infinitely many, similar bugs out there. It's just that the right conditions for them to show up haven't been met yet.

6 days agoby bboozzoo

Legitimate question on if the talent exodus from AWS is starting to take its toll. I’m talking about all the senior long-turned folks jumping ship for greener pastures, not the layoffs this week which mostly didn’t touch AWS (folks saying that will happen in future rounds).

The fact that there was an outage is not unexpected… it happens… but all the stumbling and length to get things under control was concerning.

6 days agoby JCM9

> In normal operation of EC2, the DWFM maintains a large number (~10^6) of active leases against physical servers and a very small number (~10^2) of broken leases, the latter of which the manager is actively attempting to reestablish.

It sounds like fixing broken leases takes a lot more work than renewing functional leases? Funnily enough, there is already an AWS blog post about specifically preventing these types of issues: https://aws.amazon.com/builders-library/reliability-and-cons...

6 days agoby __turbobrew__

> Most obviously, RCA has an infinite regress problem

Root cause analysis is just like any other tool. Failure to precisely define the nature of the problem is what usually turns RCA into a wild goose chase. Consider the following problem statements:

"The system feels yucky to use. I don't like it >:("

"POST /reports?option=A is slow around 3pm on Tuesdays"

One of these is more likely to provide a useful RCA that proceeds and halts in a reasonable way.

"AWS went down"

Is not a good starting point for a useful RCA session. "AWS" and "down" being the most volatile terms in this context. What parts of AWS? To what extent were they down? Is the outage intrinsic to each service or due to external factors like DNS?

"EC2 instances in region X became inaccessible to public internet users between Y & Z"

This is the kind of grain I would be doing my PPTX along if I was working at AWS. You can determine that there was a common thread after the fact. Put it in your conclusion / next-steps slide. Starting hyper-specific means that you are less likely to get distracted and can arrive at a good answer much faster. Aggregating the conclusions of many reports, you could then prioritize the strategy for preventing this in the future.

6 days agoby bob1029

I can't imagine a more uncomfortable place to try and troubleshoot all this than in a hotel lobby surrounded by a dozen coworkers.

6 days agoby SketchySeaBeast

It looks from the public writeup that the thing programming the DNS servers didn't acquire a lease on the server to prevent concurrent access to the same record set. I'd love to see the internal details on that COE.

I think when there is an extended outage it exposes the shortcuts. If you have 100 systems, and one or two can't start fast from zero, and they're required to get back to running smoothly, well you're going to have a longer outage. How would you deal with that, you'd uniformly across your teams subject them to start from zero testing. I suspect though that many teams are staring down a scaling bottleneck, or at least were for much of Amazon's life and so scaling issues (how do we handle 10x usage growth in the next year and half, which are the soft spots that will break) trump cold start testing. Then you get a cold start event with that last one being 5 years ago and 1 or 2 out of your 100 teams falls over and it takes multiple hours all hands on deck to get it to start.

6 days agoby fizlebit

14 hours drops them to at least 99.8% uptime, over a whole year. Where will that be advertised?

5 days agoby vrighter

Man, AWS is at it again. That big outage on Oct 20 was rough, and now (Oct 29) it looks like things are shaky again. SMH.

us-east-1 feels like a single point of failure for half the internet. People really need to look into multi-region, maybe even use AI like Zast.ai for intelligent failover so we're not all dead in the water when N. Virginia sneezes.

6 days agoby zastai0day

As someone building a SaaS product launching soon, outages like this are incredibly instructive—though also terrifying.

The "cold start" analogy resonates. As an indie builder, I'm constantly making decisions that optimize for "ship fast" vs "what happens when this breaks." The reality is: you can't architect for every failure mode when you're a team of 1-2 people.

But here's what's fascinating about this analysis: the recommendation for control theory and circuit breakers isn't just for AWS-scale systems. Even small products benefit from graceful degradation patterns. The difference is—at AWS scale, a metastable failure can cascade for 14 hours. At indie scale, you just... lose customers and never know why.

The talent exodus point is also concerning. If even AWS—with their resources and institutional knowledge—can struggle when senior folks leave, what chance do startups have? This is why I'm increasingly convinced that boring, well-documented tech stacks matter more than cutting-edge ones. When you're solo, you need to be able to debug at 2am without digging through undocumented microservice chains.

Question: For those running prod systems, how do you balance "dogfooding" (running on your own infra) vs "use boring, reliable stuff" (like AWS for your control plane)? Is there a middle ground?

6 days agoby sanskarix