A software update introduced a bug where a switch recovering from a brief outage would send a message that caused neighboring switches to restart. The cascade propagated because every switch ran identical software. The bug was in the recovery logic — the very code designed to restore service after a failure.
75 million blocked calls. Approximately $60 million in lost revenue. The incident demonstrated that monoculture — every node running the same software — turns a single bug into a network-wide failure.
The Incident
On January 15, 1990, AT&T's long-distance telephone network suffered a cascading failure that blocked approximately 75 million phone calls over 9 hours. One in every two calls failed to connect. It was the worst telephone outage in AT&T's history.
The Root Cause
A software update had been deployed to AT&T's 4ESS switching systems — the backbone of the long-distance network. The update contained a bug in the recovery logic: when a switch experienced a brief outage and came back online, it would send a signal to neighboring switches indicating it was back in service. A flaw in the new code caused the receiving switch to briefly take itself offline in response. When that switch came back online, it sent its own recovery signal — causing its neighbors to briefly go offline. And so on.
The bug was three lines of code in a break statement within a recovery routine. The cascade propagated across the entire network because every 4ESS switch in the country was running the identical software version with the identical bug.
Why It Matters
The bug was in the recovery code — the code designed to restore service after a failure. The mechanism meant to heal the network was the mechanism that brought it down. And the cascade was total because every switch was identical. Monoculture — running the same code everywhere — means a single bug has a single blast radius: everything.