Keyboard Navigation
W
A
S
D
or arrow keys · M for map · Q to exit
← Back to Incident Room
2023bugCorporation

Amazon Prime Video — The Per-Frame State Machine

Orders of magnitude higher infrastructure cost than necessary. Published as a 'success story' rather than a post-mortem.

2 min read
Root Cause

Video quality monitoring service processed every frame through individual AWS Step Function state transitions, designed for orchestration not high-frequency data processing

Aftermath

Team moved to monolith, reduced costs 90%. Published blog post. ThePrimeagen's reaction video went viral, highlighting the irony of AWS not understanding their own products.

The Incident

Amazon Prime Video's audio/video quality monitoring service was built on AWS Step Functions and Lambda. The service checked every video stream for quality defects — dropped frames, corruption, block artifacts.

The architecture processed every frame of every stream through individual Step Function state transitions. Step Functions charge per state transition. At video scale — 24-30 frames per second per stream — this meant millions of state transitions per stream.

The Architecture

``

Video stream → Step Function → Lambda (per frame) → S3 → Lambda → SNS

``

Each frame triggered a state machine transition. Each transition cost money. Each Lambda invocation had cold start potential. The architecture was designed for orchestration workflows (approve this order, route this ticket) not high-frequency data processing.

The "Fix"

The team collapsed the distributed architecture into a single monolith process. Same logic. Same quality checks. 90% cost reduction.

They published this as a blog post titled "Scaling up the Prime Video audio/video monitoring service and reducing costs by 90%." The framing was: we discovered monoliths can be better than microservices for some workloads.

The Reaction

ThePrimeagen's response captured what the blog post didn't say: this wasn't a discovery about microservices vs monoliths. This was Amazon — the company that built and sells AWS — not understanding which of their own products was appropriate for this workload. Step Functions are for state machines with infrequent transitions, not per-frame video processing.

Why It Matters

The "microservices for everything" best practice of 2015 was the design assumption that created this disaster. The architecture made sense on a whiteboard. It made sense in a design review. It didn't make sense when applied to a data flow that generates millions of events per second. Right-size your architecture to your data flow.

Techniques
microservices overuseper item orchestrationcost explosion