Skip to content

Anti-Primer: AWS Lambda

Everything that can go wrong, will — and in this story, it does.

The Setup

A backend team is migrating a batch processing job from EC2 cron to Lambda. They assume Lambda is 'just a function' and skip the operational considerations. The migration must complete before the EC2 instance reservation expires.

The Timeline

Hour 0: Cold Start Blindness

Deploys a Java Lambda with a 15-second cold start behind an API Gateway. The deadline was looming, and this seemed like the fastest path forward. But the result is first requests after idle periods timeout; users see intermittent 504 errors.

Footgun #1: Cold Start Blindness — deploys a Java Lambda with a 15-second cold start behind an API Gateway, leading to first requests after idle periods timeout; users see intermittent 504 errors.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: No Timeout Configuration

Leaves the default 3-second timeout on a function that processes large files. Under time pressure, the team chose speed over caution. But the result is function is killed mid-processing; partial results corrupt the downstream database.

Footgun #2: No Timeout Configuration — leaves the default 3-second timeout on a function that processes large files, leading to function is killed mid-processing; partial results corrupt the downstream database.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Unbounded Concurrency

No reserved concurrency limit; a traffic spike triggers 1,000 concurrent invocations. Nobody pushed back because the shortcut looked harmless in the moment. But the result is downstream RDS connection pool is exhausted; database rejects all connections.

Footgun #3: Unbounded Concurrency — no reserved concurrency limit; a traffic spike triggers 1,000 concurrent invocations, leading to downstream RDS connection pool is exhausted; database rejects all connections.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Synchronous in Disguise

Calls another Lambda synchronously from within a Lambda for 'simplicity'. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is cascading timeouts when the downstream function is slow; costs 2x the compute time.

Footgun #4: Synchronous in Disguise — calls another Lambda synchronously from within a Lambda for 'simplicity', leading to cascading timeouts when the downstream function is slow; costs 2x the compute time.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 Cold Start Blindness First requests after idle periods timeout; users see intermittent 504 errors Primer: Provisioned concurrency for latency-sensitive paths; lighter runtimes
2 No Timeout Configuration Function is killed mid-processing; partial results corrupt the downstream database Primer: Set timeout to match actual processing time with margin
3 Unbounded Concurrency Downstream RDS connection pool is exhausted; database rejects all connections Primer: Reserved concurrency limits and connection pooling (RDS Proxy)
4 Synchronous in Disguise Cascading timeouts when the downstream function is slow; costs 2x the compute time Primer: Use async invocation, SQS, or Step Functions for Lambda-to-Lambda

Damage Report

  • Downtime: 3-6 hours of degraded or unavailable cloud services
  • Data loss: Possible if storage or database resources were affected
  • Customer impact: API errors, failed transactions, or service unavailability for end users
  • Engineering time to remediate: 12-24 engineer-hours across incident response, root cause analysis, and remediation
  • Reputation cost: Internal trust erosion; potential AWS billing surprises; customer-facing impact report required

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on cold start blindness, they would have learned: Provisioned concurrency for latency-sensitive paths; lighter runtimes.
  • Footgun #2: If the engineer had read the primer, section on no timeout configuration, they would have learned: Set timeout to match actual processing time with margin.
  • Footgun #3: If the engineer had read the primer, section on unbounded concurrency, they would have learned: Reserved concurrency limits and connection pooling (RDS Proxy).
  • Footgun #4: If the engineer had read the primer, section on synchronous in disguise, they would have learned: Use async invocation, SQS, or Step Functions for Lambda-to-Lambda.

Cross-References