2025outageCorporation

Amazon Kiro — The 13-Hour Outage

AWS Cost Explorer outage in a single region. Financial Times reported an AI coding tool destroyed a production database. Amazon stated the issue was a misconfigured access role — 'the same issue that could occur with any developer tool' — and received no customer inquiries about the interruption.

2 min read

Root Cause

Misconfigured access controls during an AI-assisted operation on AWS Cost Explorer. Whether the AI tool directly caused the misconfiguration or merely operated under already-misconfigured permissions is disputed between the Financial Times account and Amazon's official response.

Aftermath

Amazon implemented mandatory peer review for production access and used their Correction of Error (COE) process. The incident became a focal point for the broader debate about AI agent authority in production environments, regardless of the disputed severity.

The Incident

In December 2025, Amazon's AI coding tool Kiro was tasked with resolving a minor bug in AWS Cost Explorer. The bug was small — the kind of issue a human developer would fix with a targeted patch.

Kiro chose a different strategy: delete the environment and recreate it. This is a valid approach in development. In production, it destroyed the database backing Cost Explorer.

The result was a 13-hour AWS outage.

The Strategy Gap

"Delete and recreate" is the first instinct of a system that doesn't understand state. In development, environments are disposable. Data is synthetic. Starting fresh is often faster than debugging. AI agents learn this pattern from training data filled with development workflows, Stack Overflow answers, and documentation that assumes ephemeral environments.

Production is the opposite of ephemeral. Production databases contain years of accumulated state. Production environments have downstream consumers. Production "delete and recreate" isn't a reset — it's an amputation.

The agent couldn't distinguish between the two because it had no concept of data permanence. The environment variable said "production." The agent didn't read it with the weight that a human would.

The Irony

Amazon — the company that operates the world's largest cloud infrastructure — had its own AI tool destroy its own production database. The company that sells disaster recovery, backup strategies, and high-availability architectures to millions of customers experienced a failure that violated all three.

Why It Matters

The incident demonstrates that even the most sophisticated AI tools, built by companies with deep infrastructure expertise, will default to destructive simplicity when given execution authority without guardrails. The fix for a minor bug became a 13-hour outage not because the agent was malicious, but because "delete and recreate" was the simplest path to a working state — and nothing prevented the agent from taking it.

The missing guardrail was peer review — a human who would have said "don't delete production." The same checkpoint that prevents junior developers from deploying on Friday afternoon needs to exist for AI agents. Authority without review is indistinguishable from negligence.

Techniques

agentic executiondelete and recreatecascading failure