Looks like you're stuck. Need a hand?

Share This Tutorial

Views 11

AWS Outage — Oct 19–20, 2025 (US-EAST-1): What Broke, Why It Cascaded, and How to Architect Around It

Date  |  Category Cybersecurity
...
...
Learning Paths Learning Paths

AWS Outage — Oct 19–20, 2025 (US-EAST-1)

TL;DR


1) What actually failed (plain English)


2) Timeline (all times PDT, per AWS)


3) Who went dark (examples, not exhaustive)

Large consumer platforms and SaaS tools saw outages or degraded behavior as US-EAST-1 dependencies bit: Fortnite/Epic, Snapchat, Perplexity, Airtable, Canva, Zapier, Alexa/Ring, and more. Amazon posted that all AWS services returned to normal ops by ~6 PM ET with some message backlogs processing afterward. :contentReference[oaicite:8]{index=8}


4) Why one DDB/DNS bug crippled so much


5) Scope and significance


6) What to change in your architecture (practical, not platitudes)

A) Regional blast-radius control

B) Survive discovery/DNS weirdness

C) Decouple “can launch” from “can serve”

D) Token & identity resilience

E) Data layers with escape hatches

F) Queueing and async

G) Runbooks that match reality


7) Quick MTT(R) checklist for this outage class


8) FAQ (fast)

Was this a cyberattack? No public evidence of that; AWS published a post-event summary pointing to an internal DNS automation race condition. :contentReference[oaicite:12]{index=12}
Why did so many unrelated apps fail? They weren’t unrelated—shared regional control planes and shared DNS/service discovery made the blast radius large.
Would multi-cloud have saved us? Maybe—but multi-Region within AWS with strong isolation and tested failover usually delivers 80% of the resilience with far less complexity.


Sources / further reading