1978hardwarePublic

Intel DRAM — The Chips That Were Radioactive

Every DRAM chip shipped in a ceramic package between the early 1970s and the late 1970s was slowly corrupting memory at random, unpredictable intervals. The errors were transient — no hardware damage, no crash signature — making them nearly impossible to trace. Affected systems produced silent data corruption that was attributed to software bugs, timing issues, or manufacturing defects. The scale is unknown because most affected systems never knew they were affected.

6 min read

Root Cause

The ceramic packages used to enclose Intel DRAM chips contained trace amounts of naturally occurring uranium and thorium — radioactive impurities in the alumina (aluminum oxide) ceramic compound used for hermetic sealing. As those isotopes decayed, they emitted alpha particles: high-energy helium nuclei that traveled the few microns from the package lid to the silicon die. When an alpha particle struck a DRAM storage cell, it deposited enough charge to flip the stored bit — a 0 became a 1, or a 1 became a 0 — without leaving any permanent trace. The physical hardware was undamaged. The stored value was simply wrong. Tim May, an Intel engineer, identified the mechanism in 1978 while investigating inexplicable memory errors. He published his findings with Murray Woods in a 1979 IEEE paper that established the field of soft error analysis. May and Woods received the IEEE W.R.G. Baker Award in 1981 for the work.

Aftermath

The discovery forced a fundamental rethinking of memory reliability. The industry had previously modeled memory failure as a binary — a cell worked or it didn't. Soft errors introduced a third state: the cell works, and the value is wrong. Mitigation strategies deployed across the industry over the following decade included: purification of ceramic packaging materials to reduce uranium and thorium content; polyimide and other passivation coatings applied directly over the die to block alpha particles; packaging material transitions toward plastic and other low-radioactivity substrates; and the development of Error-Correcting Code (ECC) memory — a mathematical layer that detects and corrects single-bit errors in real time. ECC memory is now standard in all servers, enterprise storage, and any system where silent data corruption is unacceptable. The threat did not go away with cleaner packaging: as DRAM cells continued to shrink, cosmic ray neutrons — previously too weak to affect large cells — became a dominant soft error source, requiring ongoing ECC deployment even in chips built from pristine materials.

The Incident

In the mid-1970s, Intel and other semiconductor manufacturers were fielding mysterious reports from customers: their systems were producing random, unrepeatable errors. A calculation would return the wrong answer. A stored value would be different when read back. The programs were correct. The hardware passed diagnostics. The errors disappeared on reboot. They came back later, in a different place, at a different time.

The standard explanations — cosmic interference, noise on the power supply, a marginal timing spec — didn't hold up. Tim May, an Intel engineer, began investigating systematically in 1977. His approach was methodical: he needed to rule out every known failure mechanism before accepting that something unknown was happening.

What he found was that the chips themselves were radioactive.

The Root Cause

The ceramic hermetic packaging used for early DRAM chips was manufactured from alumina — aluminum oxide — a material that provided excellent thermal and electrical properties but was sourced from natural clay. Natural clay contains trace amounts of uranium-238 and thorium-232, isotopes that have been decaying since Earth formed. These quantities were small — parts per million — but in a package sitting microns above a sensitive silicon die, small was enough.

As uranium and thorium decayed, they emitted alpha particles: helium nuclei ejected at several MeV of energy. Alpha particles are stopped by a few centimeters of air or a sheet of paper. They are stopped by almost nothing at the scale of a chip package. A particle emitted from the underside of the ceramic lid had a clear path to the silicon surface.

When an alpha particle struck a DRAM storage cell, it ionized a trail through the silicon, generating electron-hole pairs along its path. If the ionization track passed close enough to a storage node, the generated charge could exceed the cell's critical charge threshold — the minimum charge required to maintain its stored state. The cell would flip. A 0 would read as a 1. A 1 would read as a 0.

The hardware was physically undamaged. The package had done exactly what it was designed to do. The radioactive decay was operating precisely as physics required. Nothing had broken. A bit had simply changed value.

Why It Was So Hard to Find

Soft errors are invisible to every diagnostic designed for hard failures. The cell passes continuity tests. The chip passes burn-in. The system passes POST. The diagnostic completes without error. Then, hours later, under real workload, a particle strikes. A value changes. If the value is in a register that gets checked, the system might catch it. If the value is in buffered data that gets written out and read back later, the corruption propagates silently. If the value is in code — which was rare but not impossible for self-modifying or JIT-compiled systems — behavior becomes undefined.

The intermittency made software teams suspect themselves. If a bug reproduced inconsistently, the natural assumption was a race condition, an uninitialized variable, a timing dependency. Blaming the chip packaging required knowing that chip packaging could be the source — which nobody did until May's investigation.

The Threat Model No One Had

The semiconductor industry had built its reliability models around two assumptions: a transistor either works or it doesn't, and external electromagnetic interference comes from clearly identifiable sources (power lines, RF equipment). Neither model had a slot for "the package is slowly emitting ionizing radiation and flipping your bits."

This is not a software failure. It is not a design failure in the conventional sense — the engineers who specified the ceramic packages were not thinking about radioactive decay because radioactive decay had never been on the threat model. It was a gap between physics and engineering models that persisted unnoticed because the symptoms looked like everything else.

May and Woods's 1979 IEEE paper, "Alpha-Particle-Induced Soft Errors in Dynamic Memories," established the quantitative framework: the soft error rate (SER), the critical charge (Q_crit), and the relationship between cell size and vulnerability. As cells shrank over subsequent decades, Q_crit dropped — meaning a smaller particle strike could flip a bit — and soft error rates climbed even as packaging materials were purified, because by then the dominant source had shifted to cosmic ray neutrons from the atmosphere.

The Permanent Fix That Isn't

There is no physical fix that eliminates soft errors. Every DRAM cell in production today can be flipped by a sufficiently energetic particle. The industry's response was architectural: Error-Correcting Code (ECC) memory adds redundant bits to each memory word and uses Hamming codes or similar schemes to detect and correct single-bit errors in real time. ECC turns a physics problem into an information theory problem — and information theory, unlike radioactive decay, can be engineered around.

ECC is now mandatory in any system where data integrity matters: servers, storage arrays, scientific instruments, aerospace, automotive safety systems. Consumer DRAM, by cost, mostly still ships without it — which means that the phone in your pocket and the laptop you're reading this on are, right now, silently accumulating and correcting soft errors at a rate determined by your altitude, your proximity to radiation sources, and the laws of nuclear physics.

Why It Matters

Tim May's discovery established a principle that has become foundational to hardware reliability: the physical substrate is part of the threat model. Every abstraction layer — from transistor to logic gate to register to memory to software — sits on physical matter that obeys physics, and physics does not respect software correctness proofs.

The Intel DRAM incident is also the origin of the term "soft error" itself. The language the industry uses to discuss memory reliability — SER, Q_crit, FIT rate, ECC coverage — traces directly to this discovery. Every server memory spec sheet that lists an ECC error rate is describing the quantified, managed consequence of the fact that some of your memory bits are being flipped by particles arriving from the sun and from the decay products of ancient uranium buried in your chip packaging.

The code was correct. The hardware was working. Physics was the bug.

Techniques

alpha particle bit flipradioactive packagingsoft errorunmodeled threat