AWS Outage — Oct 19–20, 2025 (US-EAST-1)
TL;DR
- Root cause: A latent race condition in DynamoDB’s automated DNS management created an empty DNS record for
dynamodb.us-east-1.amazonaws.com. That bricked endpoint resolution until operators intervened. The fix exposed fragility in EC2 control-plane and NLB health-check flows. :contentReference[oaicite:0]{index=0}
- Impact windows (PDT):
- DynamoDB endpoint failures: Oct 19 11:48 PM → Oct 20 2:40 AM
- EC2 new instance launches / networking delays: 2:25 AM → 1:50 PM
- NLB connection errors due to health-check behavior: 5:30 AM → 2:09 PM :contentReference[oaicite:1]{index=1}
- Blast radius: 100+ AWS services affected; popular apps (Snapchat, Fortnite, Perplexity, Alexa, etc.) saw outages or severe degradation. :contentReference[oaicite:2]{index=2}
- Duration: Worst of it spanned ~14–15 hours; Amazon said services were back to normal by Monday afternoon (Oct 20), with some backlogs draining after. :contentReference[oaicite:3]{index=3}
- Why it matters: Dependency chains (DNS → DDB → EC2 DWFM/Network Manager → NLB health checks → Lambda/ECS/EKS/STS/Redshift…) can turn a single regional control-plane fault into a platform-wide brownout. Build for graceful degradation and regional independence, not just “five nines” on paper.
1) What actually failed (plain English)
- DynamoDB DNS automation glitched. Its DNS planner/enactor system applied an older plan while a cleanup deleted the newer one, leaving the regional public endpoint with no IPs. Automation then couldn’t repair the state. Manual ops restored DNS within ~3 hours. :contentReference[oaicite:4]{index=4}
- EC2 control-plane couldn’t keep up. EC2’s DropletWorkflow Manager (DWFM) lost leases while DDB was unreachable; when DDB came back, DWFM entered congestive collapse and needed controlled restarts + throttling. New instance launches and Network Manager propagation lagged for hours. :contentReference[oaicite:5]{index=5}
- NLB health checks flapped. New instances were brought into service before their network state had propagated, so health checks alternated fail/pass, triggering AZ DNS failovers and connection errors until failover behavior was adjusted. :contentReference[oaicite:6]{index=6}
2) Timeline (all times PDT, per AWS)
- 11:48 PM Oct 19 — DDB DNS breaks; clients (and AWS internal services) can’t resolve endpoint.
- 12:38 AM — AWS identifies DDB DNS as the source.
- ~2:25 AM — DNS info restored; global tables catch up by 2:32 AM.
- 2:25 AM → 1:50 PM — EC2 launches impaired; DWFM/Network Manager backlogs; full EC2 recovery 1:50 PM.
- 5:30 AM → 2:09 PM — NLB connection errors; engineers disable automatic AZ failover at 9:36 AM, re-enable 2:09 PM.
- Other services:
- Lambda: 11:51 PM → 2:15 PM (with throttling/backlog draining).
- ECS/EKS/Fargate: 11:45 PM → 2:20 PM.
- STS/IAM sign-in: errors around 11:51 PM → early morning; secondary STS blip 8:31–9:59 AM.
- Redshift: core ops back 2:21 AM, but some clusters recovered fully by 4:05 AM Oct 21 due to EC2 replacement flows. :contentReference[oaicite:7]{index=7}
3) Who went dark (examples, not exhaustive)
Large consumer platforms and SaaS tools saw outages or degraded behavior as US-EAST-1 dependencies bit: Fortnite/Epic, Snapchat, Perplexity, Airtable, Canva, Zapier, Alexa/Ring, and more. Amazon posted that all AWS services returned to normal ops by ~6 PM ET with some message backlogs processing afterward. :contentReference[oaicite:8]{index=8}
4) Why one DDB/DNS bug crippled so much
- Service discovery is a shared choke point. DDB keeps hundreds of thousands of DNS records for load balancers/endpoints in a region; break the planner/enactor and you can de-reference the service entirely.
- Control-plane dependencies stack up. EC2’s server lease manager (DWFM) and network propagation rely on state that DDB or adjacent systems back. If that state’s unavailable, new capacity becomes hard to launch or wire.
- Health-check automation can amplify pain. NLB interpreted “not yet networked” as “unhealthy,” tripping cross-AZ failovers and draining good capacity.
- Downstream services inherit the blast radius. Lambda, ECS/EKS, STS, Redshift, Connect all saw errors/timeouts because their control paths or scaling loops cross those layers. :contentReference[oaicite:9]{index=9}
5) Scope and significance
- This was the biggest internet disruption since the 2024 CrowdStrike incident, with impacts from banks to social apps. Newsrooms tracked global ripple effects as outages rolled and then eased through the day. :contentReference[oaicite:10]{index=10}
- Reports counted ~113 AWS services affected at peak. Consumer-facing impact was visible via major apps and millions of DownDetector reports. :contentReference[oaicite:11]{index=11}
6) What to change in your architecture (practical, not platitudes)
A) Regional blast-radius control
- Active/active across Regions for critical paths; don’t centralize control-plane or service discovery on US-EAST-1.
- Treat DDB/Route53 as Tier-0 dependencies; build fallback resolvers and per-Region endpoints in clients.
B) Survive discovery/DNS weirdness
- Short TTLs on your service endpoints + client-side caching with jittered backoff.
- Implement circuit breakers around SDK resolvers; if DNS fails, fail over to alternate Region or read-only modes.
C) Decouple “can launch” from “can serve”
- Don’t auto-scale blindly. Require network-ready signals (ENI attached, route propagated) before adding instances to LBs.
- Delay or gate health-check promotions during control-plane recovery; integrate progressive capacity add.
D) Token & identity resilience
- STS token lifetimes tuned to bridge outages; graceful expiry in clients.
- Allow degraded console/admin flows outside the impacted Region.
E) Data layers with escape hatches
- For DDB heavy apps: consider Global Tables with regional write isolation and fast failover, or a read-through cache (ElastiCache/Memcached) for high-QPS reads when DDB control plane is sick.
F) Queueing and async
- Expect SQS/Kinesis backlogs; design idempotent consumers, retry budgets, and catch-up modes.
G) Runbooks that match reality
- Write an “us-east-1 brownout” playbook: DNS flaps, EC2 launches stalled, NLB flapping.
- Rehearse traffic shedding, feature flags, read-only modes, and manual scaling in the other Region.
7) Quick MTT(R) checklist for this outage class
- [ ] DNS failover tested (Route53 health checks, weighted/latency policies, client fallback list).
- [ ] Per-Region DDB endpoints wired; Global Tables tested under partial replication.
- [ ] LB promotion delay: require network-ready before target enters pool.
- [ ] Auto-scaling guards: pause scale-out during control-plane incidents.
- [ ] Token lifetimes + offline admin access validated.
- [ ] Backlog drain scripts for SQS/Lambda/ECS after throttling.
- [ ] Game day: simulate “DDB endpoint not resolvable” and “EC2 launches blocked”.
8) FAQ (fast)
Was this a cyberattack? No public evidence of that; AWS published a post-event summary pointing to an internal DNS automation race condition. :contentReference[oaicite:12]{index=12}
Why did so many unrelated apps fail? They weren’t unrelated—shared regional control planes and shared DNS/service discovery made the blast radius large.
Would multi-cloud have saved us? Maybe—but multi-Region within AWS with strong isolation and tested failover usually delivers 80% of the resilience with far less complexity.
Sources / further reading
- AWS Post-Event Summary (Oct 19–20, 2025): deep technical root cause, service-by-service impact, corrective actions. :contentReference[oaicite:13]{index=13}
- Reuters real-time coverage and resolution timing. :contentReference[oaicite:14]{index=14}
- The Verge roundup of affected consumer apps and timeline markers. :contentReference[oaicite:15]{index=15}
- Al Jazeera / AP scope stats and downstream impact snapshots. :contentReference[oaicite:16]{index=16}