Anti-Primer: AWS Lambda¶

Everything that can go wrong, will — and in this story, it does.

The Setup¶

A backend team is migrating a batch processing job from EC2 cron to Lambda. They assume Lambda is 'just a function' and skip the operational considerations. The migration must complete before the EC2 instance reservation expires.

The Timeline¶

Hour 0: Cold Start Blindness¶

Deploys a Java Lambda with a 15-second cold start behind an API Gateway. The deadline was looming, and this seemed like the fastest path forward. But the result is first requests after idle periods timeout; users see intermittent 504 errors.

Footgun #1: Cold Start Blindness — deploys a Java Lambda with a 15-second cold start behind an API Gateway, leading to first requests after idle periods timeout; users see intermittent 504 errors.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: No Timeout Configuration¶

Leaves the default 3-second timeout on a function that processes large files. Under time pressure, the team chose speed over caution. But the result is function is killed mid-processing; partial results corrupt the downstream database.

Footgun #2: No Timeout Configuration — leaves the default 3-second timeout on a function that processes large files, leading to function is killed mid-processing; partial results corrupt the downstream database.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Unbounded Concurrency¶

No reserved concurrency limit; a traffic spike triggers 1,000 concurrent invocations. Nobody pushed back because the shortcut looked harmless in the moment. But the result is downstream RDS connection pool is exhausted; database rejects all connections.

Footgun #3: Unbounded Concurrency — no reserved concurrency limit; a traffic spike triggers 1,000 concurrent invocations, leading to downstream RDS connection pool is exhausted; database rejects all connections.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Synchronous in Disguise¶

Calls another Lambda synchronously from within a Lambda for 'simplicity'. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is cascading timeouts when the downstream function is slow; costs 2x the compute time.

Footgun #4: Synchronous in Disguise — calls another Lambda synchronously from within a Lambda for 'simplicity', leading to cascading timeouts when the downstream function is slow; costs 2x the compute time.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem¶

Root Cause Chain¶

#	Mistake	Consequence	Could Have Been Prevented By
1	Cold Start Blindness	First requests after idle periods timeout; users see intermittent 504 errors	Primer: Provisioned concurrency for latency-sensitive paths; lighter runtimes
2	No Timeout Configuration	Function is killed mid-processing; partial results corrupt the downstream database	Primer: Set timeout to match actual processing time with margin
3	Unbounded Concurrency	Downstream RDS connection pool is exhausted; database rejects all connections	Primer: Reserved concurrency limits and connection pooling (RDS Proxy)
4	Synchronous in Disguise	Cascading timeouts when the downstream function is slow; costs 2x the compute time	Primer: Use async invocation, SQS, or Step Functions for Lambda-to-Lambda

Damage Report¶

Downtime: 3-6 hours of degraded or unavailable cloud services
Data loss: Possible if storage or database resources were affected
Customer impact: API errors, failed transactions, or service unavailability for end users
Engineering time to remediate: 12-24 engineer-hours across incident response, root cause analysis, and remediation
Reputation cost: Internal trust erosion; potential AWS billing surprises; customer-facing impact report required

What the Primer Teaches¶

Footgun #1: If the engineer had read the primer, section on cold start blindness, they would have learned: Provisioned concurrency for latency-sensitive paths; lighter runtimes.
Footgun #2: If the engineer had read the primer, section on no timeout configuration, they would have learned: Set timeout to match actual processing time with margin.
Footgun #3: If the engineer had read the primer, section on unbounded concurrency, they would have learned: Reserved concurrency limits and connection pooling (RDS Proxy).
Footgun #4: If the engineer had read the primer, section on synchronous in disguise, they would have learned: Use async invocation, SQS, or Step Functions for Lambda-to-Lambda.

Cross-References¶

Primer — The right way
Footguns — The mistakes catalogued
Street Ops — How to do it in practice