Skip to content

AWS Lambda - Street-Level Ops

Real-world Lambda debugging and operational workflows. These are the procedures you reach for when functions are slow, failing, or costing more than expected.

Debugging Cold Starts

Cold starts are the most common Lambda performance complaint. Know how to measure and mitigate them.

# Find cold starts in CloudWatch Logs using Insights
aws logs start-query \
  --log-group-name "/aws/lambda/my-function" \
  --start-time "$(date -u -d '1 hour ago' +%s)" \
  --end-time "$(date -u +%s)" \
  --query-string '
    filter @type = "REPORT"
    | stats count() as invocations,
            count(@initDuration) as coldStarts,
            avg(@initDuration) as avgInitMs,
            max(@initDuration) as maxInitMs,
            avg(@duration) as avgDurationMs
    | display invocations, coldStarts, avgInitMs, maxInitMs, avgDurationMs'

QUERY_ID=$(aws logs start-query ... --output text)
# Wait a few seconds, then:
aws logs get-query-results --query-id "$QUERY_ID"

X-Ray gives you the full breakdown — initialization vs invocation time:

# Enable X-Ray tracing
aws lambda update-function-configuration \
  --function-name my-function \
  --tracing-config Mode=Active

# Query X-Ray for cold start traces
aws xray get-trace-summaries \
  --start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
  --filter-expression 'annotation.ColdStart == true'

Cold start mitigation strategies: 1. Reduce package size — smaller deployment = faster cold start. Use Lambda layers for shared deps. 2. Avoid VPC unless required — VPC Lambda cold starts add ENI creation time (improved since Hyperplane but still measurable). 3. Use provisioned concurrency for latency-sensitive paths. 4. Initialize outside the handler — SDK clients, DB connections, config loading go in module scope, not handler scope.

CloudWatch Logs Analysis

The REPORT line at the end of every invocation is your primary telemetry source.

# Parse REPORT lines for duration, memory, and billing
aws logs start-query \
  --log-group-name "/aws/lambda/my-function" \
  --start-time "$(date -u -d '24 hours ago' +%s)" \
  --end-time "$(date -u +%s)" \
  --query-string '
    filter @type = "REPORT"
    | stats avg(@duration) as avgMs,
            max(@duration) as maxMs,
            pct(@duration, 95) as p95Ms,
            pct(@duration, 99) as p99Ms,
            avg(@maxMemoryUsed / 1024 / 1024) as avgMemMB,
            max(@maxMemoryUsed / 1024 / 1024) as maxMemMB,
            avg(@billedDuration) as avgBilledMs
    by bin(1h) as hour
    | sort hour desc'

# Find functions that timed out
aws logs start-query \
  --log-group-name "/aws/lambda/my-function" \
  --start-time "$(date -u -d '24 hours ago' +%s)" \
  --end-time "$(date -u +%s)" \
  --query-string '
    filter @message like /Task timed out/
    | stats count() as timeouts by bin(1h) as hour
    | sort hour desc'

# Find specific error patterns
aws logs start-query \
  --log-group-name "/aws/lambda/my-function" \
  --start-time "$(date -u -d '1 hour ago' +%s)" \
  --end-time "$(date -u +%s)" \
  --query-string '
    filter @message like /Error|Exception|Traceback/
    | fields @timestamp, @message
    | sort @timestamp desc
    | limit 50'

Timeout Debugging

When a function times out, it is either doing too much work or waiting on something that is not responding.

# Check current timeout setting
aws lambda get-function-configuration --function-name my-function \
  --query '{Timeout:Timeout,MemorySize:MemorySize,Runtime:Runtime}'

# If the function calls other services, X-Ray shows where time is spent:
# - DynamoDB queries taking too long → check table capacity
# - S3 GetObject slow → large objects or throttling
# - External API timeout → downstream is the bottleneck
# - ENI creation → VPC cold start

# Check for throttling that might cause retries and cascading timeouts
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda --metric-name Throttles \
  --dimensions Name=FunctionName,Value=my-function \
  --start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
  --period 60 --statistics Sum

Key timeout rules:

Default trap: Lambda's default timeout is 3 seconds. New functions often fail with "Task timed out after 3.00 seconds" on the very first cold-start invocation because initialization alone (SDK imports, DB connections) takes longer than 3s. Always set an explicit timeout based on measured execution time plus cold-start overhead.

  • API Gateway has a hard 29-second limit. If your Lambda timeout is 30s, the API Gateway will timeout before Lambda finishes. Set Lambda timeout to 25s max when behind API Gateway.
  • SQS visibility timeout must be at least 6x the Lambda timeout (AWS recommendation).
  • Step Functions have their own timeout separate from Lambda.

VPC Lambda Cold Start Mitigation

VPC Lambda functions used to add 10+ seconds of cold start for ENI creation. Hyperplane (2019) reduced this dramatically, but VPC Lambda still adds measurable latency.

# Check if a function is in a VPC
aws lambda get-function-configuration --function-name my-function \
  --query 'VpcConfig.{SubnetIds:SubnetIds,SecurityGroupIds:SecurityGroupIds,VpcId:VpcId}'

# If VPC config is populated, check that subnets have enough IPs
for subnet in $(aws lambda get-function-configuration --function-name my-function \
  --query 'VpcConfig.SubnetIds[]' --output text); do
  aws ec2 describe-subnets --subnet-ids "$subnet" \
    --query 'Subnets[0].{SubnetId:SubnetId,AZ:AvailabilityZone,AvailableIPs:AvailableIpAddressCount,CIDR:CidrBlock}'
done

# VPC Lambda needs outbound access to AWS services
# Option 1: NAT Gateway (costs money, supports all services)
# Option 2: VPC Endpoints (free for S3/DynamoDB gateway endpoints)
# Check existing endpoints:
aws ec2 describe-vpc-endpoints --filters "Name=vpc-id,Values=vpc-abc123" \
  --query 'VpcEndpoints[].{Service:ServiceName,Type:VpcEndpointType,State:State}'

Concurrency Throttling Diagnosis

When your function is throttled, invocations are rejected or retried depending on the trigger.

# Check throttling metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda --metric-name Throttles \
  --dimensions Name=FunctionName,Value=my-function \
  --start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
  --period 60 --statistics Sum

# Check concurrent executions
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda --metric-name ConcurrentExecutions \
  --dimensions Name=FunctionName,Value=my-function \
  --start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
  --period 60 --statistics Maximum

# Check account-level concurrency limit
aws lambda get-account-settings \
  --query '{ConcurrentExecutions:AccountLimit.ConcurrentExecutions,UnreservedConcurrency:AccountLimit.UnreservedConcurrentExecutions}'

# Check function-level reserved concurrency
aws lambda get-function-concurrency --function-name my-function

# Set reserved concurrency to guarantee capacity (and cap it)
aws lambda put-function-concurrency --function-name my-function \
  --reserved-concurrent-executions 100

Throttling behavior by trigger: - API Gateway: returns 429 to the caller immediately - SQS: message goes back to the queue, retried until visibility timeout - Kinesis/DynamoDB Streams: entire shard is blocked until the batch succeeds - SNS: retries with backoff for up to 6 hours - EventBridge: retries with backoff for up to 24 hours

DLQ Monitoring Pattern

Dead letter queues catch failed invocations. If you are not monitoring them, failures disappear silently.

# Check DLQ configuration
aws lambda get-function-configuration --function-name my-function \
  --query 'DeadLetterConfig'

# If using SQS as DLQ, check message count
aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-function-dlq \
  --attribute-names ApproximateNumberOfMessages,ApproximateNumberOfMessagesNotVisible

# Set up a CloudWatch alarm on DLQ depth
aws cloudwatch put-metric-alarm \
  --alarm-name "my-function-dlq-depth" \
  --namespace AWS/SQS \
  --metric-name ApproximateNumberOfMessages \
  --dimensions Name=QueueName,Value=my-function-dlq \
  --statistic Maximum --period 60 --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts

For newer Lambda features, use on-failure destinations instead of DLQs — they provide more context (request payload, error details, stack trace).

Cost Optimization

Lambda billing is based on invocations, duration, and memory. Memory also determines CPU allocation — more memory = more CPU = faster execution = sometimes cheaper.

# Analyze cost drivers: duration * memory
aws logs start-query \
  --log-group-name "/aws/lambda/my-function" \
  --start-time "$(date -u -d '7 days ago' +%s)" \
  --end-time "$(date -u +%s)" \
  --query-string '
    filter @type = "REPORT"
    | stats count() as invocations,
            sum(@billedDuration) as totalBilledMs,
            avg(@billedDuration) as avgBilledMs,
            avg(@maxMemoryUsed / 1024 / 1024) as avgMemUsedMB
    by bin(1d) as day
    | sort day desc'

Use the AWS Lambda Power Tuning tool to find the optimal memory setting. It runs your function at different memory configurations and plots cost vs duration:

# Deploy power tuning (Step Functions state machine)
# https://github.com/alexcasalboni/aws-lambda-power-tuning
# Then invoke with:
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:us-east-1:123456789012:stateMachine:powerTuningStateMachine \
  --input '{
    "lambdaARN": "arn:aws:lambda:us-east-1:123456789012:function:my-function",
    "powerValues": [128, 256, 512, 1024, 2048, 3072],
    "num": 50,
    "payload": "{\"test\": true}",
    "parallelInvocation": true
  }'

Common findings: a function at 128MB taking 3000ms might take 500ms at 512MB — costing less because the duration reduction outweighs the memory price increase.

Under the hood: Lambda allocates CPU proportional to memory. At 1769MB you get one full vCPU. Below that, you get a fraction. A CPU-bound function at 128MB gets ~7% of a vCPU and runs painfully slowly -- bumping memory to 256MB doubles its CPU allocation and often halves execution time at the same cost.

Event Source Mapping Debugging

When Lambda is triggered by SQS, Kinesis, or DynamoDB Streams and things go wrong:

# List event source mappings for a function
aws lambda list-event-source-mappings --function-name my-function \
  --query 'EventSourceMappings[].{UUID:UUID,Source:EventSourceArn,State:State,LastProcessingResult:LastProcessingResult,BatchSize:BatchSize}'

# Check for failures in the mapping
aws lambda get-event-source-mapping --uuid abc-123-def \
  --query '{State:State,LastProcessingResult:LastProcessingResult,StateTransitionReason:StateTransitionReason}'

# Common LastProcessingResult values:
# "OK" — working fine
# "No records processed" — empty stream/queue
# "PROBLEM: ..." — function errors, check CloudWatch Logs

# For SQS: check if messages are stuck (high ApproximateAge)
aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
  --attribute-names All \
  --query 'Attributes.{Messages:ApproximateNumberOfMessages,NotVisible:ApproximateNumberOfMessagesNotVisible,Delayed:ApproximateNumberOfMessagesDelayed}'

# For Kinesis: check iterator age (how far behind the consumer is)
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda --metric-name IteratorAge \
  --dimensions Name=FunctionName,Value=my-function \
  --start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
  --period 60 --statistics Maximum
# IteratorAge > 0 and growing = consumer is falling behind
# IteratorAge in hours = seriously behind, probably throttled or erroring

Lambda Insights

Lambda Insights is an enhanced monitoring extension that provides system-level metrics.

# Enable Lambda Insights (adds the extension layer)
aws lambda update-function-configuration \
  --function-name my-function \
  --layers "arn:aws:lambda:us-east-1:580247275435:layer:LambdaInsightsExtension:38"

# Query Insights data via CloudWatch
aws cloudwatch get-metric-statistics \
  --namespace LambdaInsights \
  --metric-name memory_utilization \
  --dimensions Name=function_name,Value=my-function \
  --start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
  --period 300 --statistics Average

# Available metrics: memory_utilization, cpu_total_time, tx_bytes, rx_bytes,
#   tmp_used, tmp_max, fd_use, fd_max, threads, init_duration

Deployment Patterns

Common deployment approaches and their operational implications:

# Check current function configuration and aliases
aws lambda get-function --function-name my-function \
  --query '{Runtime:Configuration.Runtime,Handler:Configuration.Handler,CodeSize:Configuration.CodeSize,LastModified:Configuration.LastModified,Layers:Configuration.Layers[].Arn}'

# List versions and aliases (for traffic shifting / canary deploys)
aws lambda list-versions-by-function --function-name my-function \
  --query 'Versions[-5:].[Version,Description,LastModified]'
aws lambda list-aliases --function-name my-function \
  --query 'Aliases[].{Name:Name,Version:FunctionVersion,Routing:RoutingConfig}'

# Weighted alias for canary deployment (10% to new version)
aws lambda update-alias --function-name my-function --name prod \
  --function-version 42 \
  --routing-config AdditionalVersionWeights={"43"=0.1}
# Monitor errors on the new version, then shift 100%:
aws lambda update-alias --function-name my-function --name prod \
  --function-version 43 --routing-config AdditionalVersionWeights={}