2021outageCorporation

Facebook — The Six Hours That Vanished

Facebook, Instagram, WhatsApp, and Messenger went offline globally for approximately 6 hours after a BGP routing update accidentally withdrew Facebook's DNS routes from the internet. 3.5 billion users affected.

2 min read

Root Cause

During routine backbone capacity maintenance, a command accidentally withdrew the BGP routes that told the internet how to reach Facebook's DNS servers. With DNS unreachable, all services vanished. Engineers couldn't fix it remotely because their remote access tools also ran on the same network.

Aftermath

Estimated $100+ million in lost revenue. Internal badge systems, which ran on the same network, also failed — engineers couldn't enter buildings to reach the servers. Led to industry-wide review of self-referential infrastructure dependencies.

The Incident

On October 4, 2021, at approximately 15:39 UTC, Facebook, Instagram, WhatsApp, and Messenger simultaneously vanished from the internet. Not slow. Not degraded. Gone. DNS queries for facebook.com returned no results. The outage lasted approximately six hours and affected an estimated 3.5 billion users worldwide.

The Root Cause

During routine maintenance intended to assess the capacity of Facebook's backbone network, a command was issued that evaluated available backbone capacity. The command contained an error in its scope — instead of testing a subset of the backbone, it evaluated the entire backbone. The audit tool contained a bug that failed to catch the error.

The command withdrew the BGP (Border Gateway Protocol) routes that advertised Facebook's DNS nameservers to the internet. Without these routes, no DNS resolver on earth could find facebook.com. Every Facebook service — including Instagram, WhatsApp, Messenger, Workplace, and Oculus — immediately became unreachable.

The Compounding Failure

Engineers couldn't fix the problem remotely because every remote access tool they used — their VPN, their remote management consoles, their internal communication systems — also ran on Facebook's network. The fix required physical access to the data center routers.

But the badge access systems at Facebook's data centers also ran on the internal network. Engineers arriving at data centers couldn't badge in. Physical security protocols required verification through systems that were down. Teams had to be physically dispatched with credentials and manual overrides.

Why It Matters

Facebook's outage is the canonical example of self-referential infrastructure dependency: when your fix path depends on the thing that's broken, you have no fix path. The remote access tools ran on Facebook's network. The badge systems ran on Facebook's network. The communication tools ran on Facebook's network. A single routing change made 3.5 billion users' services vanish because every recovery mechanism shared the same single point of failure.

Techniques

bgp withdrawalconfiguration errorself referential dependency