Skip to content

AWS Lambda: The Function That Runs Itself

  • lesson
  • aws-lambda
  • serverless
  • cold-starts
  • vpc-networking
  • concurrency
  • event-driven-architecture
  • observability
  • cost-optimization
  • l2 ---# AWS Lambda — The Function That Runs Itself

Topics: AWS Lambda, serverless, cold starts, VPC networking, concurrency, event-driven architecture, observability, cost optimization Level: L2 (Operations) Time: 60–90 minutes Prerequisites: None (everything is explained from scratch)


The Mission

It's 9:47am on a Monday. PagerDuty fires. Your order processing Lambda — the one that reads from an SQS queue and writes to DynamoDB — is timing out intermittently. Not every invocation. Maybe 5% of them. The ones that fail get retried, some succeed on retry, some don't. Your DLQ is filling up. Customers are complaining about orders stuck in "processing."

Your teammate says "cold starts." Your tech lead says "it's the VPC attachment, I've seen this before." The CloudWatch dashboard shows... a lot of lines. You need to figure out which one of them is right, or whether they're both wrong.

By the end of this lesson you'll understand: - What actually happens inside a Lambda invocation, layer by layer - Why cold starts exist and exactly how long each phase takes - When VPC attachment is the problem (and when it isn't anymore) - How to read CloudWatch Insights like a detective - The concurrency model that makes Lambda magical and dangerous - How to stop a recursive Lambda from bankrupting your team - Why "serverless" still means you think about servers

We'll build up from the execution model, then use that knowledge to solve the mission.


Part 1: What Is a Lambda, Actually?

Forget "serverless function" for a moment. Here's what Lambda really is: a tiny, frozen virtual machine that AWS thaws when a request arrives, runs your code, and freezes again.

You deploy a zip file (or container image)
AWS stores it in an internal S3 bucket
An event arrives (HTTP request, SQS message, S3 upload, cron tick)
Lambda service checks: is there a warm execution environment?
    ├── YES → route the event to it (warm start, fast)
    └── NO  → create a new one (cold start, slow)
Your handler function runs
Response goes back to the caller
Execution environment stays alive for ~5-15 minutes (hoping for another request)
No request arrives → environment is destroyed

Name Origin: The name "Lambda" comes from lambda calculus, a formal system invented by mathematician Alonzo Church in the 1930s. Lambda calculus uses the Greek letter lambda (λ) to denote anonymous functions — functions without names. That's exactly what AWS Lambda is: anonymous functions that exist only when called. Church was Alan Turing's doctoral advisor at Princeton, and lambda calculus was proven equivalent to Turing machines — two completely different formalizations that describe the same thing. Python borrowed the keyword lambda for the same reason: lambda x: x + 1 is an anonymous function.

Trivia: Lambda launched at AWS re:Invent in November 2014, supporting only Node.js with a 60-second maximum timeout. It was the first major FaaS (Function-as-a-Service) offering from any cloud provider. Google Cloud Functions and Azure Functions followed in 2016. By 2022, AWS reported Lambda was processing over 100 trillion invocations per month across all customers.

The execution environment

Each Lambda invocation runs inside a Firecracker microVM — the same technology that powers Fargate. AWS open-sourced Firecracker in 2018. It can boot a microVM in ~125 milliseconds using about 5 MB of memory overhead. That's why Lambda cold starts are measured in hundreds of milliseconds, not tens of seconds.

Your code gets: - A Linux environment (Amazon Linux 2023 for most runtimes) - The runtime you chose (Python 3.12, Node.js 20, Java 21, etc.) - 512 MB of /tmp storage by default (expandable to 10 GB) - Whatever memory you configured (128 MB to 10,240 MB) - CPU proportional to memory

Remember: The magic number is 1,769 MB = 1 full vCPU. Below that, your function gets a fraction of a CPU. At 128 MB, you get roughly 1/14th of a vCPU. A function doing JSON parsing at 128 MB might take 3 seconds; at 512 MB it might take 800ms. You often save money by spending more on memory because the shorter duration offsets the higher per-millisecond price.


Part 2: Anatomy of a Cold Start

This is where we solve part of the mystery. Let's trace what happens when Lambda creates a new execution environment for your function.

Cold start phases:
┌──────────────────────────────────────────────────────────────────────────┐
│ 1. Download    │ 2. Extract    │ 3. Runtime    │ 4. Your init   │ 5. Handler │
│    code        │    & mount    │    bootstrap  │    code runs   │    runs    │
│   ~50-200ms    │   ~10-50ms   │   ~30-100ms   │   varies       │  your code │
└──────────────────────────────────────────────────────────────────────────┘
  ←───── AWS controls these (you can't speed them up) ─────→←── you control ──→

Phase 1: Download. AWS pulls your deployment package from internal storage. A 5 MB Python zip takes ~50ms. A 250 MB package with numpy and pandas takes ~200ms. A 10 GB container image takes longer but benefits from layer caching.

Phase 2: Extract & mount. The zip is decompressed and mounted into the execution environment's filesystem.

Phase 3: Runtime bootstrap. The Python interpreter starts, the Node.js V8 engine initializes, or the JVM boots. This is why Java Lambda cold starts are notoriously slow — the JVM needs time. AWS SnapStart (launched 2022) addresses this for Java by snapshotting a pre-initialized JVM.

Phase 4: Your init code. Everything outside your handler function runs once per cold start. This is where you should put SDK clients, database connections, and config loading:

# This runs ONCE per cold start (phase 4)
import boto3
import os

dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table(os.environ["TABLE_NAME"])
sqs = boto3.client("sqs")

# This runs on EVERY invocation (phase 5)
def handler(event, context):
    for record in event["Records"]:
        body = json.loads(record["body"])
        table.put_item(Item=body)
    return {"statusCode": 200}

Phase 5: Handler. Your actual request processing. On a warm start, only this phase runs.

Measuring cold starts in the wild

The REPORT line at the end of every Lambda invocation tells you everything. When a cold start happens, it includes Init Duration:

REPORT RequestId: abc-123
  Duration: 45.67 ms
  Billed Duration: 46 ms
  Memory Size: 256 MB
  Max Memory Used: 89 MB
  Init Duration: 312.45 ms    ← THIS means cold start happened

No Init Duration? Warm start.

Let's query CloudWatch Insights to find out how bad the cold starts are for our mission function:

# How many cold starts vs warm starts in the last hour?
aws logs start-query \
  --log-group-name "/aws/lambda/order-processor" \
  --start-time "$(date -u -d '1 hour ago' +%s)" \
  --end-time "$(date -u +%s)" \
  --query-string '
    filter @type = "REPORT"
    | stats count() as invocations,
            count(@initDuration) as coldStarts,
            avg(@initDuration) as avgColdStartMs,
            max(@initDuration) as maxColdStartMs,
            avg(@duration) as avgDurationMs,
            pct(@duration, 99) as p99DurationMs
  '
# Wait a few seconds for results
QUERY_ID="<from above command output>"
aws logs get-query-results --query-id "$QUERY_ID"

Typical output for a healthy function:

invocations: 12,847
coldStarts: 43
avgColdStartMs: 387.2
maxColdStartMs: 1,247.8
avgDurationMs: 52.3
p99DurationMs: 198.4

43 cold starts out of 12,847 invocations is 0.3%. That's normal. If you're seeing 20%+ cold starts, either traffic is very bursty or something is forcing environment recycling.


Flashcard Check #1

Question Answer
What are the 5 phases of a Lambda cold start? Download code, extract/mount, runtime bootstrap, your init code, handler execution
Where should you initialize SDK clients and DB connections? Outside the handler (module scope) — runs once per cold start, reused on warm starts
What's the magic memory number for 1 full vCPU? 1,769 MB
How do you identify a cold start in CloudWatch logs? The REPORT line includes Init Duration for cold starts

Part 3: The VPC Problem (And Why Your Tech Lead Might Be Living in 2018)

Here's the thing your tech lead remembers: before 2019, attaching a Lambda to a VPC was brutal. Every cold start had to create an Elastic Network Interface (ENI) in your VPC subnet. That added 10-15 seconds to cold starts. It was awful. People wrote blog posts about it. Conference talks were given. "Don't put Lambda in a VPC" became gospel.

Then AWS shipped Hyperplane in September 2019, and the world changed.

Pre-2019 (the bad old days)

Cold start + VPC (pre-Hyperplane):
┌─────────────────────────────────────────────────────────────────────────────────┐
│ Download │ Extract │ Create ENI │ Attach ENI │ Runtime │ Init │ Handler │
│  ~100ms  │  ~30ms  │  ~8-10s    │  ~2-5s     │  ~50ms  │ varies │ your code │
└─────────────────────────────────────────────────────────────────────────────────┘
                      ←── this was the killer ──→
Total cold start: 10-15 seconds. For a user-facing API. Unacceptable.

Post-2019 (Hyperplane)

AWS pre-creates shared ENIs when you deploy or update the function. Cold starts no longer include ENI creation:

Cold start + VPC (post-Hyperplane):
┌──────────────────────────────────────────────────────────────────────────┐
│ Download │ Extract │ Runtime │ Init code │ Handler │
│  ~100ms  │  ~30ms  │  ~50ms  │ varies    │ your code │
└──────────────────────────────────────────────────────────────────────────┘
The ENI is already there. Cold start is nearly identical to non-VPC.

But VPC Lambda still has quirks:

  1. No internet without a NAT Gateway. Lambda in your VPC doesn't get internet access automatically. If your function calls an external API or even an AWS service endpoint, it needs a NAT Gateway ($32/month + data processing) or VPC endpoints.

  2. Subnet IP exhaustion. Each Lambda execution environment uses an IP from your subnet. If you have a /24 subnet (254 usable IPs) and 300 concurrent Lambda invocations, you'll run out.

  3. DNS resolution. VPC Lambda uses the VPC's DNS resolver. If that resolver is overloaded or misconfigured, Lambda invocations slow down.

Let's check if our mission function is VPC-attached:

# Is the function in a VPC?
aws lambda get-function-configuration \
  --function-name order-processor \
  --query 'VpcConfig.{VpcId:VpcId,Subnets:SubnetIds,SecurityGroups:SecurityGroupIds}'

If that returns subnets and security groups, it's VPC-attached. Check if the subnets have enough IPs:

# Check available IPs in the Lambda subnets
for subnet in subnet-0a1b2c3d subnet-0e4f5a6b; do
  aws ec2 describe-subnets --subnet-ids "$subnet" \
    --query 'Subnets[0].{SubnetId:SubnetId,AZ:AvailabilityZone,AvailableIPs:AvailableIpAddressCount,CIDR:CidrBlock}'
done

Gotcha: If you see AvailableIPs: 3 on a subnet that feeds a Lambda with 200 concurrent invocations — that's your problem. The fix is bigger subnets or more subnets across AZs. This isn't a cold start issue; it's an infrastructure sizing issue that looks like intermittent timeouts.


Part 4: Back to the Mission — Finding the Real Culprit

So is it cold starts or VPC? Let's be systematic. We have CloudWatch Insights, X-Ray, and the metrics dashboard. Here's the diagnostic ladder:

Step 1: Are timeouts correlated with cold starts?

aws logs start-query \
  --log-group-name "/aws/lambda/order-processor" \
  --start-time "$(date -u -d '1 hour ago' +%s)" \
  --end-time "$(date -u +%s)" \
  --query-string '
    filter @type = "REPORT"
    | fields @initDuration > 0 as isColdStart,
             @duration > 25000 as isTimeout
    | stats count() as total,
            sum(isColdStart and isTimeout) as coldStartTimeouts,
            sum(not isColdStart and isTimeout) as warmStartTimeouts,
            sum(isColdStart and not isTimeout) as coldStartOK
  '

If warmStartTimeouts is high, cold starts are not the problem. Something downstream is slow.

Step 2: What's the function waiting on?

Enable X-Ray tracing if it's not already on:

aws lambda update-function-configuration \
  --function-name order-processor \
  --tracing-config Mode=Active

X-Ray breaks down time spent in each AWS SDK call — DynamoDB reads, SQS deletes, external HTTP calls. If the function spends 24 seconds waiting on a DynamoDB PutItem, that's your answer.

Step 3: Is there throttling?

aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda --metric-name Throttles \
  --dimensions Name=FunctionName,Value=order-processor \
  --start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
  --period 60 --statistics Sum

Throttling means you've hit the concurrency limit. SQS messages go back to the queue and retry, but if the throttling is sustained, messages age out and land in the DLQ.

Step 4: Check concurrency.

aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda --metric-name ConcurrentExecutions \
  --dimensions Name=FunctionName,Value=order-processor \
  --start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
  --period 60 --statistics Maximum

# Check what the limit is
aws lambda get-function-concurrency --function-name order-processor

Mental Model: Think of Lambda concurrency like a restaurant. Reserved concurrency is having your own reserved tables — guaranteed, but you can't use more than you reserved. Unreserved concurrency is the shared dining area — first come, first served, and if the restaurant fills up, you wait outside. Provisioned concurrency is having a personal chef standing by — zero wait time, but you pay whether you eat or not.

The twist

In our mission scenario, the real culprit is often none of the obvious suspects. Here's what we find:

  • Cold start rate: 0.4% (normal)
  • VPC: yes, but Hyperplane is active, subnets have plenty of IPs
  • No throttling
  • Concurrent executions: 45 (well under the 200 reserved limit)

But X-Ray shows that 5% of DynamoDB PutItem calls take 20+ seconds. The DynamoDB table is in on-demand mode but has a partition that's hot — all orders go to the same partition key pattern. The timeout isn't Lambda's fault at all. It's downstream.

This is the lesson within the lesson: Lambda timeout investigations almost always lead somewhere else. Lambda is the messenger, not the murderer.


Flashcard Check #2

Question Answer
What changed about VPC Lambda cold starts in 2019? AWS Hyperplane pre-creates ENIs at deploy time, eliminating the 10-15s ENI creation penalty from cold starts
What does VPC-attached Lambda lose by default? Internet access — you need a NAT Gateway or VPC endpoints
If warm start invocations are also timing out, what does that tell you? The problem is downstream (database, external API, throttling), not cold starts
What's the difference between reserved and provisioned concurrency? Reserved guarantees AND caps concurrent slots. Provisioned pre-warms environments for zero cold starts (and you pay for it whether used or not)

Part 5: Event Sources — How Lambda Gets Triggered

Lambda doesn't run by itself. Something sends it an event. The trigger type determines retry behavior, error handling, and how you think about failures.

The three invocation models

Model Who waits? Retries Examples
Synchronous Caller blocks until response Caller handles retries API Gateway, ALB, SDK invoke()
Asynchronous Lambda queues it, caller gets 202 Lambda retries twice, then DLQ S3, SNS, EventBridge, CloudFormation
Event source mapping Lambda polls the source Depends on source type SQS, Kinesis, DynamoDB Streams, Kafka

This matters because the failure mode is completely different:

Synchronous: API Gateway sends a request. Lambda errors. API Gateway returns 500 to the user. The user sees an error. You notice immediately.

Asynchronous: S3 fires an event. Lambda errors. Lambda retries twice. Fails again. Event goes to DLQ (if configured). If no DLQ? The event is silently dropped. Nobody notices until a customer asks why their uploaded file wasn't processed.

Event source mapping (SQS): Lambda polls 10 messages. Processes 7 successfully. Message 8 throws an exception. Without partial batch failure reporting, all 10 messages go back to the queue. Messages 1-7 get reprocessed. If your processing isn't idempotent, you just double-charged seven customers.

# Partial batch failure handling — essential for SQS triggers
def handler(event, context):
    failures = []
    for record in event["Records"]:
        try:
            order = json.loads(record["body"])
            process_order(order)
        except Exception as e:
            logger.error(f"Failed to process {record['messageId']}: {e}")
            failures.append({"itemIdentifier": record["messageId"]})

    # Only failed messages go back to the queue
    return {"batchItemFailures": failures}

Gotcha: Partial batch failure reporting requires enabling ReportBatchItemFailures on the event source mapping. Without it, returning batchItemFailures does nothing — Lambda still treats any exception as a full batch failure. This is a common gotcha because the function code looks correct but the configuration is missing.

# Enable partial batch failure reporting on an SQS trigger
aws lambda update-event-source-mapping \
  --uuid "your-mapping-uuid" \
  --function-response-types "ReportBatchItemFailures"

Part 6: The War Story — Recursive Lambda and the $45,000 Weekend

War Story: A team set up an image processing pipeline: when a user uploads to the uploads/ prefix in S3, a Lambda resizes the image and writes the result to the same bucket under thumbnails/. Works great. Then someone changes the S3 event trigger from "prefix: uploads/" to "all object creates" — because they want to process files from another prefix too. Now: upload triggers Lambda, Lambda writes thumbnail to the same bucket, that write triggers Lambda, Lambda writes another thumbnail, that triggers Lambda... exponential recursion. Each invocation creates more invocations. In 20 minutes, the function hit 1,000 concurrent executions (the account limit), racking up millions of invocations. The weekend bill: $45,000 in Lambda compute plus S3 PUT costs. AWS now has automatic recursive loop detection (introduced 2023) that stops some patterns after 16 recursive invocations, but it doesn't catch all patterns and you should never rely on it.

How to prevent this:

  1. Never write to the same bucket/table/queue that triggers the function. Use separate source and destination resources.
  2. Set reserved concurrency as a circuit breaker. A function capped at 50 concurrent executions can't run away as fast.
  3. Set billing alarms. An alarm at $100/day would have caught this in 30 minutes, not 60 hours.
# Set reserved concurrency as a safety cap
aws lambda put-function-concurrency \
  --function-name image-processor \
  --reserved-concurrent-executions 50

# Create a billing alarm (via CloudWatch)
aws cloudwatch put-metric-alarm \
  --alarm-name "lambda-cost-spike" \
  --namespace AWS/Billing \
  --metric-name EstimatedCharges \
  --dimensions Name=Currency,Value=USD \
  --statistic Maximum --period 21600 --threshold 100 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:billing-alerts

Part 7: Lambda Layers and Powertools

Layers: shared libraries without bloat

Instead of bundling requests, boto3, or your shared utils into every function's zip, put them in a layer. Each function can attach up to 5 layers.

# Build a layer with your dependencies
mkdir -p /tmp/layer/python/lib/python3.12/site-packages
pip install requests aws-lambda-powertools \
  -t /tmp/layer/python/lib/python3.12/site-packages/
cd /tmp/layer && zip -r layer.zip python/

aws lambda publish-layer-version \
  --layer-name shared-utils \
  --zip-file fileb://layer.zip \
  --compatible-runtimes python3.12

# Attach the layer to a function
aws lambda update-function-configuration \
  --function-name order-processor \
  --layers arn:aws:lambda:us-east-1:123456789012:layer:shared-utils:1

Under the Hood: Layers are extracted into /opt/ in the execution environment. The Python runtime adds /opt/python/lib/python3.12/site-packages to sys.path automatically. Node.js looks in /opt/nodejs/node_modules. If you're wondering why the directory structure matters so much — it's because Lambda doesn't do anything clever. It just unpacks the zip to /opt/ and the runtime's default search paths do the rest.

Lambda Powertools: the stdlib for production Lambda

Lambda Powertools gives you structured logging, distributed tracing, custom metrics, idempotency, and event parsing. It's the difference between print() debugging and actual observability.

from aws_lambda_powertools import Logger, Tracer, Metrics
from aws_lambda_powertools.utilities.typing import LambdaContext
from aws_lambda_powertools.utilities.batch import (
    BatchProcessor, EventType, process_partial_response
)
import json

logger = Logger()
tracer = Tracer()
metrics = Metrics()
processor = BatchProcessor(event_type=EventType.SQS)

@tracer.capture_method
def process_order(record):
    order = json.loads(record["body"])
    logger.info("Processing order", order_id=order["id"], amount=order["total"])
    metrics.add_metric(name="OrderProcessed", unit="Count", value=1)
    # ... actual business logic ...

@logger.inject_lambda_context
@tracer.capture_lambda_handler
@metrics.log_metrics
def handler(event: dict, context: LambdaContext):
    return process_partial_response(
        event=event,
        record_handler=process_order,
        processor=processor,
        context=context,
    )

That handler gives you: - Structured JSON logs with correlation IDs (no more parsing unstructured strings) - X-Ray traces showing time in each method - Custom CloudWatch metrics (OrderProcessed count) - Automatic partial batch failure handling


Part 8: Lambda@Edge vs CloudFront Functions

Both run code at CDN edge locations. The choice is about complexity vs speed.

Dimension Lambda@Edge CloudFront Functions
Runtime Node.js, Python JavaScript only
Max execution 5s (viewer), 30s (origin) Sub-millisecond (<2ms hard limit)
Memory Up to 10 GB 2 MB
Network access Yes (can call APIs, databases) No
Use cases Auth with DB lookup, A/B testing, image resizing Header manipulation, URL rewrite, token validation
Price Per request + duration Per request only (5-6x cheaper)
Deploy scope us-east-1 only, replicated globally Any region, replicated globally

Rule of thumb: If you can do it in under 2ms without network access, use CloudFront Functions. For everything else, Lambda@Edge.

Gotcha: Lambda@Edge functions must be deployed in us-east-1. They're automatically replicated to edge locations. If you deploy to eu-west-1, it just won't work as a CloudFront trigger, and the error message isn't great.


Part 9: Observability — Seeing Inside the Black Box

Lambda has no SSH. You can't htop your way to a diagnosis. Your only eyes are logs, metrics, and traces.

Structured logging (stop using print)

# Bad: unstructured, unsearchable
print(f"Processing order {order_id}")

# Good: structured JSON, queryable in CloudWatch Insights
logger.info("Processing order", extra={
    "order_id": order_id,
    "customer_id": customer_id,
    "amount": amount
})
# Output: {"level":"INFO","message":"Processing order","order_id":"ord-123",...}

CloudWatch Insights queries you'll actually use

# Find the slowest invocations
aws logs start-query \
  --log-group-name "/aws/lambda/order-processor" \
  --start-time "$(date -u -d '24 hours ago' +%s)" \
  --end-time "$(date -u +%s)" \
  --query-string '
    filter @type = "REPORT"
    | sort @duration desc
    | limit 20
    | display @requestId, @duration, @initDuration, @maxMemoryUsed / 1024 / 1024 as memMB
  '

# Memory utilization — are you over-provisioned or about to OOM?
aws logs start-query \
  --log-group-name "/aws/lambda/order-processor" \
  --start-time "$(date -u -d '7 days ago' +%s)" \
  --end-time "$(date -u +%s)" \
  --query-string '
    filter @type = "REPORT"
    | stats avg(@maxMemoryUsed) / 1024 / 1024 as avgMemMB,
            max(@maxMemoryUsed) / 1024 / 1024 as maxMemMB,
            avg(@memorySize) / 1024 / 1024 as configuredMemMB
    by bin(1d) as day
    | sort day desc
  '

# Find timeout patterns by hour
aws logs start-query \
  --log-group-name "/aws/lambda/order-processor" \
  --start-time "$(date -u -d '7 days ago' +%s)" \
  --end-time "$(date -u +%s)" \
  --query-string '
    filter @message like /Task timed out/
    | stats count() as timeouts by bin(1h) as hour
    | sort hour desc
  '

X-Ray: the distributed trace

X-Ray shows you where time goes inside the function — not just total duration, but "250ms in DynamoDB PutItem, 80ms in SQS SendMessage, 15ms in your code."

# Enable X-Ray
aws lambda update-function-configuration \
  --function-name order-processor \
  --tracing-config Mode=Active

# Find traces with cold starts
aws xray get-trace-summaries \
  --start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
  --filter-expression 'service("order-processor") AND annotation.ColdStart == true'

Part 10: The Serverless vs Containers Debate

This comes up in every architecture review and every interview. Here's the honest version.

Factor Lambda Containers (ECS/Fargate/EKS)
Cold start 100ms–10s (runtime dependent) 0 (always running)
Max duration 15 minutes Unlimited
Scaling Automatic, per-request Auto-scaling policies, you configure
Cost at low traffic Nearly free (pay per invocation) Minimum cost for running tasks
Cost at high traffic Expensive (per-ms billing adds up) Cheaper (amortized over long-running processes)
Operational burden Minimal (no patching, no capacity planning) Moderate (container images, task definitions, clusters)
Local dev experience Awkward (SAM local, Docker Lambda) Native (docker-compose, same image locally and prod)
State Stateless between invocations Can be stateful
Ecosystem AWS-specific (vendor lock-in) Portable (containers run anywhere)

Mental Model: Lambda is a taxi. You pay per ride, it's available instantly, and you don't worry about parking or maintenance. Containers are your own car. Higher fixed cost, but per-mile it's cheaper for your daily commute. If you drive twice a month, get a taxi. If you drive every day, buy the car.

The crossover point: Lambda tends to be cheaper below ~1 million invocations/month for typical workloads. Above that, the per-invocation cost starts to exceed the fixed cost of a container task. But this varies wildly by memory, duration, and traffic pattern.

Interview Bridge: "When would you choose Lambda over containers?" is a common interview question. Strong answers mention: event-driven workloads with bursty traffic, glue code between AWS services, scheduled tasks that run briefly, and APIs with unpredictable traffic patterns. Weak answers say "always use serverless" or "Lambda can't do anything serious."


Part 11: The Cost Model — Know What You're Paying For

Lambda billing has three components:

Total cost = (invocations x $0.20/million)
           + (GB-seconds x $0.0000166667)
           + (provisioned concurrency x $0.0000041667/GB-second)

GB-seconds = (memory in GB) x (billed duration in seconds)
Billed duration = ceiling(actual duration to nearest 1ms)

Example: A function with 256 MB memory running for 200ms: - Memory: 0.25 GB - Duration: 0.2 seconds - GB-seconds: 0.25 x 0.2 = 0.05 - Compute cost: 0.05 x $0.0000166667 = $0.00000083 - Request cost: $0.0000002 - Total per invocation: ~$0.000001 (one millionth of a dollar)

At 10 million invocations/month: ~$10/month. That's why Lambda is compelling for low-to-moderate traffic.

But at 1 billion invocations/month, the same function costs ~$1,000/month — and a Fargate task running 24/7 that can handle the same throughput might cost $200/month.

Trivia: Lambda's free tier never expires: 1 million requests and 400,000 GB-seconds per month, forever. A lightweight function checking a health endpoint every minute uses about 43,800 invocations/month — well within the free tier.

The power tuning trick

The AWS Lambda Power Tuning tool (open source, runs as a Step Functions state machine) tests your function at different memory settings and shows you the cost/performance curve:

# Deploy and invoke the power tuning tool
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:us-east-1:123456789012:stateMachine:powerTuningStateMachine \
  --input '{
    "lambdaARN": "arn:aws:lambda:us-east-1:123456789012:function:order-processor",
    "powerValues": [128, 256, 512, 1024, 1769, 3008],
    "num": 50,
    "payload": "{\"Records\": [{\"body\": \"{\\\"id\\\": \\\"test-001\\\"}\"}]}",
    "parallelInvocation": true
  }'

Common result: a function at 128 MB taking 3,000ms costs more than the same function at 512 MB taking 800ms, because 0.125 * 3 = 0.375 GB-s vs 0.5 * 0.8 = 0.4 GB-s — nearly the same cost but 4x better latency.


Flashcard Check #3

Question Answer
What are the three Lambda invocation models? Synchronous (caller waits), asynchronous (Lambda queues, retries twice), event source mapping (Lambda polls)
What happens to an async Lambda event that fails all retries with no DLQ configured? It's silently dropped — no error, no record, no alert
Why is the S3 → Lambda → same S3 pattern dangerous? It creates recursive invocations (infinite loop) that scale exponentially
What does Lambda Power Tuning show you? The cost and performance curve at different memory settings, so you can find the optimal configuration
Where are Lambda Layer contents extracted to? /opt/ in the execution environment

Part 12: Error Handling and Dead Letter Queues

When async invocations fail, you need somewhere for the bodies to go.

DLQ (Dead Letter Queue): The original approach. Failed events go to an SQS queue or SNS topic after all retries are exhausted.

Destinations: The newer, better approach. You can route both successful and failed invocations to different targets, and the destination receives more context (request payload, error details, stack trace).

# Configure destinations (preferred over DLQ)
aws lambda put-function-event-invoke-config \
  --function-name order-processor \
  --destination-config '{
    "OnSuccess": {"Destination": "arn:aws:sqs:us-east-1:123456789012:order-success"},
    "OnFailure": {"Destination": "arn:aws:sqs:us-east-1:123456789012:order-failures"}
  }' \
  --maximum-retry-attempts 2 \
  --maximum-event-age-in-seconds 3600

# Monitor the failure queue
aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/order-failures \
  --attribute-names ApproximateNumberOfMessages

# Set an alarm so failures don't go unnoticed
aws cloudwatch put-metric-alarm \
  --alarm-name "order-processor-dlq-depth" \
  --namespace AWS/SQS \
  --metric-name ApproximateNumberOfMessages \
  --dimensions Name=QueueName,Value=order-failures \
  --statistic Maximum --period 60 --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts

Gotcha: DLQs and destinations serve different invocation types. DLQs work with async invocations. For event source mappings (SQS, Kinesis), the dead-letter handling is configured on the source (the SQS queue's RedrivePolicy), not on the Lambda function. Mixing these up is one of the most common Lambda configuration errors — your DLQ config is correct but events never arrive because they're coming from an event source mapping, not an async invocation.


Cheat Sheet

What Command / Value Notes
Check cold start rate CloudWatch Insights: filter @type = "REPORT" \| stats count(@initDuration) as coldStarts Init Duration present = cold start
Check concurrency aws lambda get-function-concurrency --function-name X Empty = unreserved (shared pool)
Set reserved concurrency aws lambda put-function-concurrency --function-name X --reserved-concurrent-executions N Guarantees AND caps
Set provisioned concurrency aws lambda put-provisioned-concurrency-config --function-name X --qualifier ALIAS --provisioned-concurrent-executions N Eliminates cold starts; costs money idle
Enable X-Ray aws lambda update-function-configuration --function-name X --tracing-config Mode=Active See time inside each SDK call
Check VPC config aws lambda get-function-configuration --function-name X --query VpcConfig Empty = no VPC (simpler)
1 vCPU memory 1,769 MB Below this = fractional CPU
Max timeout 900 seconds (15 minutes) API Gateway hard limit: 29 seconds
Max deployment zip 50 MB (zipped), 250 MB (unzipped) Container images: up to 10 GB
Max layers 5 per function Extracted to /opt/
Default concurrency 1,000 per account per region Shared across all functions
Free tier 1M requests + 400K GB-seconds/month Never expires
Async retries 2 (default) Then DLQ/destination or silently dropped
SQS visibility timeout >= 6x Lambda timeout AWS recommendation

Exercises

Exercise 1: Find the cold start rate (5 minutes)

Pick a Lambda function in your account (or use the AWS console to create a simple one). Run this CloudWatch Insights query:

filter @type = "REPORT"
| stats count() as total,
        count(@initDuration) as coldStarts,
        (count(@initDuration) / count(*)) * 100 as coldStartPct

What's the cold start percentage? Is it above or below 1%?

What you should see For functions with steady traffic, cold start rates below 1% are normal. If you see 10%+, check whether traffic is bursty (low at night, spike in morning) or if something is forcing environment recycling (frequent deploys, code changes, memory updates).

Exercise 2: Right-size memory (15 minutes)

Find a function that runs at 128 MB. Check its actual memory usage:

filter @type = "REPORT"
| stats avg(@maxMemoryUsed / 1024 / 1024) as avgMB,
        max(@maxMemoryUsed / 1024 / 1024) as maxMB,
        avg(@duration) as avgMs

Then increase memory to 256 MB and re-run the same query after a few invocations. Did duration decrease? Did the total cost (duration x memory) change?

Hint Calculate GB-seconds: `(memory_MB / 1024) * (duration_ms / 1000)`. Compare this value at 128 MB vs 256 MB. If GB-seconds decreased (or stayed the same while duration halved), you found a win.

Exercise 3: Build a safe event pipeline (20 minutes)

Design (on paper or in a SAM template) an image processing pipeline that avoids the recursive trigger problem:

  • Source bucket: uploads-bucket
  • Lambda: resize images to thumbnails
  • Destination: where do thumbnails go?

What event filter would you use? What concurrency limit would you set? What alarms?

Solution outline - Separate destination bucket: `thumbnails-bucket` (never the same as source) - S3 event filter: prefix `uploads/`, suffix `.jpg,.png` (be specific) - Reserved concurrency: 50 (circuit breaker) - CloudWatch alarm on concurrent executions > 40 (early warning) - Billing alarm at $50/day - DLQ for failed processing so images aren't lost

Takeaways

  • Cold starts are rarely the real problem. Measure before assuming. CloudWatch Insights count(@initDuration) tells you in 10 seconds.
  • VPC Lambda is no longer the boogeyman. Post-2019 Hyperplane eliminated the ENI cold start penalty. The real VPC gotcha is forgetting the NAT Gateway for internet access.
  • 1,769 MB = 1 vCPU. Below that, your function gets fractional CPU. More memory often means less cost because shorter duration offsets the higher per-ms price.
  • Never write to the same resource that triggers you. Use separate source and destination buckets/tables/queues. Set reserved concurrency as a circuit breaker.
  • Partial batch failure reporting is not on by default. Without it, one bad SQS message poisons the whole batch and causes duplicate processing.
  • Lambda timeouts usually point downstream. The function is the messenger, not the murderer. X-Ray shows you where the time actually goes.