AWS Lambda Footguns¶

Mistakes that cause outages, data loss, runaway costs, or silent failures in AWS Lambda.

1. VPC Lambda without NAT (cannot reach the internet)¶

You attach your Lambda function to a VPC so it can access an RDS database in a private subnet. The function works for database queries but fails when calling an external API or AWS services. You get timeout errors. The function has no internet access because the VPC's private subnets do not have a NAT Gateway, and Lambda in a VPC does not get internet access automatically.

Fix: If your Lambda needs both VPC resources and internet access, route through a NAT Gateway. If it only needs AWS services (S3, DynamoDB, SQS), use VPC endpoints instead — they are cheaper and do not require NAT. If the function does not need VPC access at all (no RDS, no ElastiCache, no internal APIs), remove the VPC configuration entirely for simpler networking and faster cold starts.

2. Timeout too close to API Gateway limit (29 seconds)¶

API Gateway has a hard 29-second timeout for Lambda integrations. You set your Lambda timeout to 30 seconds. When the function takes 29.5 seconds, API Gateway returns 504 to the client, but the Lambda function keeps running for another 0.5 seconds (completing its work but the response is lost). The client retries, causing duplicate processing.

Fix: Set Lambda timeout to 25 seconds when behind API Gateway — leave a buffer. Better yet, for long-running operations, return an immediate 202 Accepted with a job ID, process asynchronously, and let the client poll for results. If you need more than 29 seconds of processing, API Gateway is the wrong trigger — use SQS, Step Functions, or direct invocation.

3. Not handling partial batch failures¶

Your Lambda processes SQS messages in batches of 10. Message 7 fails. Without partial batch failure reporting, Lambda treats the entire batch as failed. All 10 messages go back to the queue. Messages 1-6 and 8-10 are reprocessed — potentially causing duplicate writes, double charges, or duplicate notifications.

# BAD: entire batch fails if any message fails
def handler(event, context):
    for record in event["Records"]:
        process(record)  # if this throws, ALL messages retry

# GOOD: report individual failures
def handler(event, context):
    failures = []
    for record in event["Records"]:
        try:
            process(record)
        except Exception as e:
            failures.append({"itemIdentifier": record["messageId"]})
    return {"batchItemFailures": failures}

Fix: Enable ReportBatchItemFailures on the event source mapping and return individual failed message IDs. This way, only the failed messages are retried. Without this, your processing must be idempotent (safe to run multiple times) for every message in every batch.

4. Lambda recursion (function triggers itself)¶

Your Lambda writes to an S3 bucket. That bucket has an S3 event trigger that invokes the same Lambda. Each invocation writes to S3, which triggers another invocation, which writes to S3. This is an infinite loop. It scales exponentially until you hit the concurrency limit, and you accumulate massive costs in minutes.

The same pattern can occur with: Lambda writing to SQS that triggers the same Lambda, or Lambda writing to DynamoDB Streams that feeds back to the same Lambda.

Fix: Never trigger a Lambda from a resource it writes to. If you must, use distinct source and destination buckets/tables/queues. AWS now has Lambda recursive loop detection (auto-stops after 16 recursive invocations for some patterns), but do not rely on it. Set reserved concurrency on functions to cap runaway scaling. Set up billing alarms as a safety net.

5. Oversized deployment packages¶

Your deployment package is 250 MB (the maximum for zip) because it includes pandas, numpy, scipy, and every other dependency. Cold starts take 5-10 seconds because AWS must download and extract the package. Every invocation that creates a new execution environment pays this penalty.

Fix: Strip unnecessary dependencies. Use Lambda layers for shared libraries. Use container images (up to 10 GB, but layers are cached). For Python, use pip install --target . --no-deps to avoid transitive bloat. Exclude test files, docs, and compiled objects. Use aws-lambda-powertools instead of the full boto3 (boto3 is already included in the runtime).

# Check deployment package size
aws lambda get-function --function-name my-function \
  --query 'Configuration.CodeSize'
# Returns bytes — divide by 1048576 for MB

6. Environment variables with secrets¶

You store database passwords, API keys, and tokens in Lambda environment variables. These are visible to anyone with lambda:GetFunctionConfiguration permission. They show up in the console, in CloudFormation templates, in Terraform state files, and in any logging that dumps the environment.

# Anyone with this permission can see all your secrets:
aws lambda get-function-configuration --function-name my-function \
  --query 'Environment.Variables'

Fix: Use AWS Secrets Manager or SSM Parameter Store for secrets. Cache the secret value in memory (outside the handler) to avoid API calls on every invocation. Use the AWS Parameters and Secrets Lambda Extension for automatic caching:

# Fetch secret once per cold start, not per invocation
import boto3, json
secrets = boto3.client("secretsmanager")
_db_creds = None

def get_db_creds():
    global _db_creds
    if _db_creds is None:
        resp = secrets.get_secret_value(SecretId="prod/db")
        _db_creds = json.loads(resp["SecretString"])
    return _db_creds

7. Cold starts in latency-sensitive paths¶

Your user-facing API endpoint is backed by a Lambda function. During low-traffic periods (nights, weekends), execution environments are recycled. The first request after a quiet period hits a cold start — 500ms to 3 seconds of extra latency. Users see a spinner. Synthetic monitoring fires an alert.

Fix: For latency-sensitive paths, use provisioned concurrency to pre-warm execution environments. This costs money (you pay for the pre-warmed slots whether used or not), so use it only for critical paths. Alternatively, use a CloudWatch Events rule to "warm" the function every 5 minutes — but this is fragile and does not scale. For consistently low-latency needs, consider whether Lambda is the right choice at all — ECS/Fargate with pre-warmed containers may be better.

# Set provisioned concurrency on a specific version/alias
aws lambda put-provisioned-concurrency-config \
  --function-name my-function \
  --qualifier prod \
  --provisioned-concurrent-executions 10

8. Provisioned concurrency costs¶

You set provisioned concurrency to 100 "to be safe." Each unit costs roughly $0.015/hour (varies by memory size). That is $1.50/hour, $1,080/month — for capacity that may sit idle. If your function only gets 10 concurrent invocations during peak, you are paying for 90 unused warm slots 24/7.

Fix: Use Application Auto Scaling with provisioned concurrency to scale based on actual utilization. Set a schedule: higher provisioned concurrency during business hours, lower at night. Monitor ProvisionedConcurrencyUtilization metric — if it is consistently below 50%, you are over-provisioned.

9. Not setting reserved concurrency (noisy neighbor)¶

You deploy 20 Lambda functions in an account with the default 1,000 concurrent execution limit. One function has a traffic spike and consumes 950 concurrent executions. The other 19 functions are throttled. Your payment processing function returns 429 errors because a batch analytics function ate all the concurrency.

Fix: Set reserved concurrency on critical functions. A function with reserved concurrency of 50 is guaranteed 50 concurrent slots, even if other functions are spiking. This also caps the function at 50 concurrent executions, preventing it from being the noisy neighbor. Balance reserved and unreserved concurrency across your functions.

# Reserve concurrency for critical functions
aws lambda put-function-concurrency \
  --function-name payment-processor \
  --reserved-concurrent-executions 100

aws lambda put-function-concurrency \
  --function-name notifications \
  --reserved-concurrent-executions 50

# Remaining 850 is shared by all other functions

10. Forgetting to increase memory (also increases CPU)¶

Your function is CPU-bound (compression, JSON parsing, cryptographic operations) and takes 3 seconds at 128 MB. You assume Lambda does not let you control CPU. In fact, Lambda allocates CPU proportional to memory. At 1,769 MB you get one full vCPU. Increasing memory to 512 MB could cut execution time to under 1 second — and cost less total because the shorter duration outweighs the higher per-ms price.

Fix: Use the AWS Lambda Power Tuning tool to find the optimal memory setting. Test your function at 128, 256, 512, 1024, and 2048 MB. Plot cost vs duration. Often the sweet spot is 512-1024 MB for typical workloads. Do not leave all functions at 128 MB — it is almost never the cheapest option.

11. Synchronous invocations without proper error handling¶

Your API Gateway triggers Lambda synchronously. The function throws an unhandled exception. Lambda returns a 200 status to API Gateway with an error payload. API Gateway forwards this to the client as a 200 response with an error body. The client thinks the request succeeded. Data is silently lost.

Fix: Always wrap your handler in try/except and return proper HTTP status codes:

def handler(event, context):
    try:
        result = process(event)
        return {"statusCode": 200, "body": json.dumps(result)}
    except ValidationError as e:
        return {"statusCode": 400, "body": json.dumps({"error": str(e)})}
    except Exception as e:
        logger.exception("Unhandled error")
        return {"statusCode": 500, "body": json.dumps({"error": "Internal error"})}

Configure API Gateway response mapping to translate Lambda errors to appropriate HTTP status codes.

12. /tmp storage filling up across warm invocations¶

Lambda provides /tmp for temporary file storage (512 MB default, up to 10 GB). Because execution environments are reused across warm invocations, files written to /tmp accumulate. If your function writes temporary files and does not clean them up, /tmp fills up and subsequent invocations fail with "No space left on device."

Fix: Always clean up /tmp files in your handler (use try/finally or context managers). Or write to unique paths per invocation and clean up at the start of each invocation:

import os, glob, shutil

def handler(event, context):
    # Clean up old temp files at the start
    for f in glob.glob("/tmp/work-*"):
        if os.path.isdir(f):
            shutil.rmtree(f, ignore_errors=True)
        else:
            os.remove(f)

    # Use a unique path for this invocation
    work_dir = f"/tmp/work-{context.aws_request_id}"
    os.makedirs(work_dir, exist_ok=True)
    try:
        # ... do work ...
        pass
    finally:
        shutil.rmtree(work_dir, ignore_errors=True)