Anti-Primer: AWS Lambda¶
Everything that can go wrong, will — and in this story, it does.
The Setup¶
A backend team is migrating a batch processing job from EC2 cron to Lambda. They assume Lambda is 'just a function' and skip the operational considerations. The migration must complete before the EC2 instance reservation expires.
The Timeline¶
Hour 0: Cold Start Blindness¶
Deploys a Java Lambda with a 15-second cold start behind an API Gateway. The deadline was looming, and this seemed like the fastest path forward. But the result is first requests after idle periods timeout; users see intermittent 504 errors.
Footgun #1: Cold Start Blindness — deploys a Java Lambda with a 15-second cold start behind an API Gateway, leading to first requests after idle periods timeout; users see intermittent 504 errors.
Nobody notices yet. The engineer moves on to the next task.
Hour 1: No Timeout Configuration¶
Leaves the default 3-second timeout on a function that processes large files. Under time pressure, the team chose speed over caution. But the result is function is killed mid-processing; partial results corrupt the downstream database.
Footgun #2: No Timeout Configuration — leaves the default 3-second timeout on a function that processes large files, leading to function is killed mid-processing; partial results corrupt the downstream database.
The first mistake is still invisible, making the next shortcut feel justified.
Hour 2: Unbounded Concurrency¶
No reserved concurrency limit; a traffic spike triggers 1,000 concurrent invocations. Nobody pushed back because the shortcut looked harmless in the moment. But the result is downstream RDS connection pool is exhausted; database rejects all connections.
Footgun #3: Unbounded Concurrency — no reserved concurrency limit; a traffic spike triggers 1,000 concurrent invocations, leading to downstream RDS connection pool is exhausted; database rejects all connections.
Pressure is mounting. The team is behind schedule and cutting more corners.
Hour 3: Synchronous in Disguise¶
Calls another Lambda synchronously from within a Lambda for 'simplicity'. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is cascading timeouts when the downstream function is slow; costs 2x the compute time.
Footgun #4: Synchronous in Disguise — calls another Lambda synchronously from within a Lambda for 'simplicity', leading to cascading timeouts when the downstream function is slow; costs 2x the compute time.
By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.
The Postmortem¶
Root Cause Chain¶
| # | Mistake | Consequence | Could Have Been Prevented By |
|---|---|---|---|
| 1 | Cold Start Blindness | First requests after idle periods timeout; users see intermittent 504 errors | Primer: Provisioned concurrency for latency-sensitive paths; lighter runtimes |
| 2 | No Timeout Configuration | Function is killed mid-processing; partial results corrupt the downstream database | Primer: Set timeout to match actual processing time with margin |
| 3 | Unbounded Concurrency | Downstream RDS connection pool is exhausted; database rejects all connections | Primer: Reserved concurrency limits and connection pooling (RDS Proxy) |
| 4 | Synchronous in Disguise | Cascading timeouts when the downstream function is slow; costs 2x the compute time | Primer: Use async invocation, SQS, or Step Functions for Lambda-to-Lambda |
Damage Report¶
- Downtime: 3-6 hours of degraded or unavailable cloud services
- Data loss: Possible if storage or database resources were affected
- Customer impact: API errors, failed transactions, or service unavailability for end users
- Engineering time to remediate: 12-24 engineer-hours across incident response, root cause analysis, and remediation
- Reputation cost: Internal trust erosion; potential AWS billing surprises; customer-facing impact report required
What the Primer Teaches¶
- Footgun #1: If the engineer had read the primer, section on cold start blindness, they would have learned: Provisioned concurrency for latency-sensitive paths; lighter runtimes.
- Footgun #2: If the engineer had read the primer, section on no timeout configuration, they would have learned: Set timeout to match actual processing time with margin.
- Footgun #3: If the engineer had read the primer, section on unbounded concurrency, they would have learned: Reserved concurrency limits and connection pooling (RDS Proxy).
- Footgun #4: If the engineer had read the primer, section on synchronous in disguise, they would have learned: Use async invocation, SQS, or Step Functions for Lambda-to-Lambda.
Cross-References¶
- Primer — The right way
- Footguns — The mistakes catalogued
- Street Ops — How to do it in practice