AWS CloudWatch Footguns¶
Mistakes that cause missed alerts, surprise bills, and blind spots in production monitoring.
1. CloudWatch Logs costs at scale¶
You enable verbose DEBUG logging for your application fleet. Each instance writes 10GB/day. With 20 instances, that is 200GB/day in log ingestion at $0.50/GB — $100/day, $3,000/month — just for log ingestion. Storage adds another $0.03/GB/month on top. You find out when the monthly bill arrives.
Fix: Log at INFO level in production. Use structured logging with levels so you can filter. Set retention policies on every log group. Audit costs with:
# Find log groups with no retention (storing forever)
aws logs describe-log-groups \
--query 'logGroups[?!retentionInDays].{Name:logGroupName,StoredGB:storedBytes}' \
--output table
# Check CloudWatch costs in Cost Explorer
aws ce get-cost-and-usage \
--time-period Start=2026-03-01,End=2026-03-19 \
--granularity MONTHLY \
--metrics UnblendedCost \
--filter '{"Dimensions":{"Key":"SERVICE","Values":["AmazonCloudWatch"]}}'
2. Default EC2 metrics are 5-minute resolution¶
You set a CPU alarm with a 60-second period on a standard EC2 instance. The alarm fires erratically because default EC2 metrics only publish every 5 minutes. Your 1-minute period alarm evaluates against stale or missing data points, causing INSUFFICIENT_DATA flapping.
Fix: Either use 300-second periods for standard monitoring, or enable detailed monitoring ($3.50/instance/month) for 1-minute metrics:
# Enable detailed monitoring
aws ec2 monitor-instances --instance-ids i-0abc123def456
# Match your alarm period to your monitoring resolution
# Standard monitoring: period >= 300
# Detailed monitoring: period >= 60
3. Alarm evaluation gotchas (period vs evaluation periods)¶
You create an alarm with --period 60 --evaluation-periods 5 --datapoints-to-alarm 3. You think this means "alarm if 3 of the last 5 minutes exceed the threshold." Correct. But you also set --treat-missing-data missing. During low-traffic periods, 3 of 5 data points are missing. The alarm enters INSUFFICIENT_DATA. Your pager goes silent right when the system might be broken.
Fix: Understand the evaluation window: - Period: aggregation granularity (60s, 300s, etc.) - Evaluation periods: how many consecutive periods to examine - Datapoints to alarm: how many of those periods must breach - Missing data: what happens when a data point is absent
For availability monitoring, use --treat-missing-data breaching. For performance metrics during variable-traffic hours, use --treat-missing-data notBreaching.
4. Missing data treated as breaching by default... or not¶
You create an alarm and forget to set --treat-missing-data. The default is missing, which transitions the alarm to INSUFFICIENT_DATA. Your alarm sits in INSUFFICIENT_DATA for days without anyone noticing because INSUFFICIENT_DATA does not trigger alarm actions by default.
Meanwhile, another team sets --treat-missing-data breaching on a metric that legitimately has gaps during off-peak hours. Their alarm fires at 2 AM every night, creating alert fatigue.
Fix: Set --treat-missing-data explicitly on every alarm. Document why you chose each setting:
# For "this should always have data" metrics (host up/down):
--treat-missing-data breaching
# For "data only appears during activity" metrics (request latency):
--treat-missing-data notBreaching
# For "keep current state if no data" (composite alarm inputs):
--treat-missing-data ignore
5. Custom metric namespace collision¶
Two teams independently publish custom metrics to the namespace Custom/Application. Team A publishes ErrorCount with dimension Service=api. Team B publishes ErrorCount with dimension Service=worker. Both think they own the namespace. Team A creates an alarm on ErrorCount without specifying dimensions and gets Team B's data mixed in.
Fix: Use team-specific or service-specific namespaces:
# Bad: shared namespace
--namespace "Custom/Application"
# Good: scoped namespace
--namespace "Custom/PaymentService"
--namespace "Custom/OrderService"
Each unique combination of namespace + metric name + dimensions is a separate metric. Be deliberate about your dimension schema.
6. Log retention set to "never expire"¶
You create log groups and never set retention. Default retention is indefinite. Six months later, you have 50TB of logs stored at $0.03/GB/month — $1,500/month in storage alone. Logs from dev and staging environments from six months ago are still there, costing money and providing zero value.
Fix: Set retention at log group creation time. Audit existing groups:
# Set retention on creation
aws logs create-log-group --log-group-name /myapp/production
aws logs put-retention-policy --log-group-name /myapp/production --retention-in-days 30
# Fix all groups with no retention
for group in $(aws logs describe-log-groups \
--query 'logGroups[?!retentionInDays].logGroupName' --output text); do
echo "Setting 30-day retention: $group"
aws logs put-retention-policy --log-group-name "$group" --retention-in-days 30
done
Retention tiers: production (30-90 days), staging (7-14 days), dev (3-7 days). Export to S3 for long-term archival at lower cost.
7. Cross-region metrics not available¶
You have resources in us-east-1 and eu-west-1. You build a dashboard in us-east-1 and add metrics for your EU resources. The widgets are empty. CloudWatch metrics are regional — you can only query metrics in the region where they were published.
Fix: Use cross-account/cross-region dashboards (supported since 2020) by adding the region to your dashboard widget source:
{
"metrics": [
[{"region": "eu-west-1"}, "AWS/EC2", "CPUUtilization", "InstanceId", "i-eu-abc123"]
]
}
For alarms, you must create them in the same region as the metric. There is no cross-region alarm.
8. Alarm fires but nobody gets notified¶
You create an alarm with --alarm-actions pointing to an SNS topic. The alarm transitions to ALARM. But the SNS topic has no subscribers, or the subscription is pending confirmation, or the endpoint is wrong. The alarm state shows ALARM in the console, but no human was notified.
Fix: Always verify the full notification chain:
# Check alarm actions
aws cloudwatch describe-alarms --alarm-names "my-alarm" \
--query 'MetricAlarms[0].{Actions:AlarmActions,OKActions:OKActions}'
# Check SNS topic subscriptions
aws sns list-subscriptions-by-topic \
--topic-arn arn:aws:sns:us-east-1:123456789012:ops-critical
# Check for pending confirmations
aws sns list-subscriptions-by-topic \
--topic-arn arn:aws:sns:us-east-1:123456789012:ops-critical \
--query 'Subscriptions[?SubscriptionArn==`PendingConfirmation`]'
# Test the notification pipeline
aws sns publish --topic-arn arn:aws:sns:us-east-1:123456789012:ops-critical \
--message "Test notification" --subject "CloudWatch Test Alert"
9. Logs Insights queries timeout on large log groups¶
You run a Logs Insights query across a log group with 500GB of data for the past 30 days. The query times out at 15 minutes. You get partial results or no results. You have no idea if the error you are hunting exists or if the query simply did not finish scanning.
Fix: Narrow the time range. Logs Insights scans data sequentially — smaller windows complete faster:
# Bad: scan 30 days of a huge log group
--start-time $(date -u -d '30 days ago' +%s) --end-time $(date -u +%s)
# Good: narrow to the incident window
--start-time $(date -u -d '2 hours ago' +%s) --end-time $(date -u +%s)
# Also: add specific filters early in the query to reduce data scanned
# Fields and filter clauses prune data before aggregation
If you need historical analysis, export logs to S3 and query with Athena instead.
10. CloudWatch Agent config format changes between versions¶
You upgrade the CloudWatch Agent and your config stops working. Field names changed between the old collectd / StatsD format and the new JSON wizard format. The agent starts but silently stops collecting certain metrics. You do not notice until an alarm fires INSUFFICIENT_DATA days later.
Fix: After any agent upgrade:
# Check agent version
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent -version
# Verify config is valid
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config -m ec2 -s \
-c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
# Check agent logs for errors
tail -100 /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log | grep -i error
# Verify metrics are arriving
aws cloudwatch list-metrics --namespace CWAgent \
--dimensions Name=InstanceId,Value=i-0abc123def456
# If empty, the agent is not publishing. Check IAM role and agent logs.
Pin agent versions in your AMI or config management. Test upgrades in staging before rolling to production.