AWS CloudWatch - Street-Level Ops¶
Real-world workflows for monitoring, debugging alarms, querying logs, and building operational dashboards.
Querying Logs with Insights¶
Common Investigation Queries¶
# Run a Logs Insights query from the CLI
aws logs start-query \
--log-group-names /myapp/production \
--start-time $(date -u -d '1 hour ago' +%s) \
--end-time $(date -u +%s) \
--query-string 'fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100'
# Poll for results (queries are async)
QUERY_ID="<id-from-above>"
aws logs get-query-results --query-id "$QUERY_ID"
Queries you will run constantly:
# Error rate per 5-minute bucket
filter @message like /ERROR/
| stats count() as errors by bin(5m)
| sort bin desc
# Latency percentiles from structured JSON logs
filter ispresent(latency_ms)
| stats avg(latency_ms) as avg,
pct(latency_ms, 50) as p50,
pct(latency_ms, 95) as p95,
pct(latency_ms, 99) as p99
by bin(5m)
# Top talkers — which source IPs generate the most requests
parse @message '"source_ip":"*"' as src_ip
| stats count() as requests by src_ip
| sort requests desc
| limit 20
# Find the slowest requests with full context
filter latency_ms > 5000
| fields @timestamp, method, path, latency_ms, status_code, trace_id
| sort latency_ms desc
| limit 50
# Count unique users in the last hour
filter ispresent(user_id)
| stats count_distinct(user_id) as unique_users by bin(1h)
# Lambda error investigation
filter @type = "REPORT"
| parse @message 'Duration: * ms' as duration
| parse @message 'Max Memory Used: * MB' as memory
| parse @message 'Init Duration: * ms' as cold_start
| stats avg(duration) as avg_duration,
max(duration) as max_duration,
count() as invocations,
sum(ispresent(cold_start)) as cold_starts
by bin(5m)
Querying Across Multiple Log Groups¶
# Search across all application log groups
aws logs start-query \
--log-group-names /myapp/production /myapp/worker /myapp/scheduler \
--start-time $(date -u -d '30 minutes ago' +%s) \
--end-time $(date -u +%s) \
--query-string 'filter @message like /correlation_id=abc123/
| fields @timestamp, @logStream, @message
| sort @timestamp asc'
Setting Up Critical Alarms¶
CPU, Memory, and Disk (with CloudWatch Agent)¶
# CPU alarm (built-in metric)
aws cloudwatch put-metric-alarm \
--alarm-name "prod-web-cpu-high" \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456 \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-pager
# Memory alarm (requires CloudWatch Agent)
aws cloudwatch put-metric-alarm \
--alarm-name "prod-web-memory-high" \
--namespace CWAgent \
--metric-name mem_used_percent \
--dimensions Name=InstanceId,Value=i-0abc123def456 \
--statistic Average \
--period 300 \
--threshold 90 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--treat-missing-data breaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-pager
# Disk alarm (requires CloudWatch Agent)
aws cloudwatch put-metric-alarm \
--alarm-name "prod-web-disk-critical" \
--namespace CWAgent \
--metric-name disk_used_percent \
--dimensions Name=InstanceId,Value=i-0abc123def456 Name=path,Value=/ Name=fstype,Value=ext4 \
--statistic Maximum \
--period 300 \
--threshold 90 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--treat-missing-data breaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-critical
Application-Level Alarms¶
# Error rate alarm (from metric filter on logs)
aws cloudwatch put-metric-alarm \
--alarm-name "prod-error-rate-high" \
--namespace Custom/MyApp \
--metric-name ErrorCount \
--statistic Sum \
--period 300 \
--threshold 50 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-pager
# ALB 5xx rate
aws cloudwatch put-metric-alarm \
--alarm-name "prod-alb-5xx-high" \
--namespace AWS/ApplicationELB \
--metric-name HTTPCode_Target_5XX_Count \
--dimensions Name=LoadBalancer,Value=app/prod-alb/abc123 \
--statistic Sum \
--period 60 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 3 \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-pager
Debugging Alarm State¶
When an alarm is misbehaving:
# Check alarm details and current state
aws cloudwatch describe-alarms \
--alarm-names "prod-web-cpu-high" \
--output json | jq '.MetricAlarms[0] | {
AlarmName,
StateValue,
StateReason,
StateUpdatedTimestamp,
EvaluationPeriods,
DatapointsToAlarm,
TreatMissingData,
Threshold
}'
# Check alarm history (state transitions)
aws cloudwatch describe-alarm-history \
--alarm-name "prod-web-cpu-high" \
--history-item-type StateUpdate \
--max-records 20
# Get the actual metric data the alarm is evaluating
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456 \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average \
--output table
# List all alarms currently firing
aws cloudwatch describe-alarms --state-value ALARM --output table \
--query 'MetricAlarms[].{Name:AlarmName,State:StateValue,Reason:StateReason}'
# List alarms in INSUFFICIENT_DATA (often a config problem)
aws cloudwatch describe-alarms --state-value INSUFFICIENT_DATA --output table \
--query 'MetricAlarms[].{Name:AlarmName,Namespace:Namespace,Metric:MetricName}'
Common reasons for INSUFFICIENT_DATA:
- Missing data treatment set to missing (default)
- Wrong dimensions (typo in InstanceId, wrong path for disk metric)
- CloudWatch Agent not running or not sending to correct namespace
- Metric has not been published yet (new instance, new custom metric)
Default trap:
treat-missing-datadefaults tomissing, which puts the alarm intoINSUFFICIENT_DATAwhen no data arrives. For host-level alarms (CPU, disk), set it tobreaching-- if the agent stops reporting, that is itself a problem worth alerting on. For application metrics that are legitimately zero during off-hours, usenotBreaching.
Log-Based Metric Filters¶
Convert log patterns into metrics you can alarm on:
# Count 5xx responses from access logs
aws logs put-metric-filter \
--log-group-name /nginx/production \
--filter-name "5xxErrors" \
--filter-pattern '[ip, id, user, timestamp, request, status_code=5*, size]' \
--metric-transformations \
metricName=Nginx5xxCount,metricNamespace=Custom/Nginx,metricValue=1,defaultValue=0
# Extract latency from JSON logs
aws logs put-metric-filter \
--log-group-name /myapp/production \
--filter-name "RequestLatency" \
--filter-pattern '{ $.latency_ms = * }' \
--metric-transformations \
metricName=RequestLatency,metricNamespace=Custom/MyApp,metricValue='$.latency_ms',defaultValue=0
# Count OOM kills from syslog
aws logs put-metric-filter \
--log-group-name /var/log/syslog \
--filter-name "OOMKills" \
--filter-pattern "Out of memory: Kill process" \
--metric-transformations \
metricName=OOMKillCount,metricNamespace=Custom/System,metricValue=1,defaultValue=0
# List existing metric filters
aws logs describe-metric-filters --log-group-name /myapp/production
Cross-Account Log Aggregation¶
Centralize logs from multiple AWS accounts into one:
# In the destination (central) account — create a destination
aws logs put-destination \
--destination-name central-logs \
--target-arn arn:aws:kinesis:us-east-1:CENTRAL_ACCOUNT:stream/log-aggregation \
--role-arn arn:aws:iam::CENTRAL_ACCOUNT:role/CWLtoKinesisRole
# In each source account — create a subscription filter
aws logs put-subscription-filter \
--log-group-name /myapp/production \
--filter-name send-to-central \
--filter-pattern "" \
--destination-arn arn:aws:logs:us-east-1:CENTRAL_ACCOUNT:destination:central-logs
Cost-Effective Log Retention Strategy¶
# Audit current retention settings (find "never expire" groups)
aws logs describe-log-groups \
--query 'logGroups[?!retentionInDays].{Name:logGroupName,StoredBytes:storedBytes}' \
--output table
# Set retention on all log groups that have none
for group in $(aws logs describe-log-groups \
--query 'logGroups[?!retentionInDays].logGroupName' --output text); do
echo "Setting 30-day retention on: $group"
aws logs put-retention-policy \
--log-group-name "$group" \
--retention-in-days 30
done
# Check storage costs by log group (largest first)
aws logs describe-log-groups \
--query 'sort_by(logGroups, &storedBytes)[-10:].{Name:logGroupName,GB:storedBytes}' \
--output table
CloudWatch Agent Troubleshooting¶
# Check if agent is running
sudo systemctl status amazon-cloudwatch-agent
# View agent logs
tail -50 /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log
# Common errors in agent log:
# "AccessDenied" — IAM role missing cloudwatch:PutMetricData or logs:PutLogEvents
# "ThrottlingException" — too many API calls, reduce collection interval
# "no such file or directory" — log file path in config does not exist
# Verify agent config
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a status
# Test IAM permissions (from the instance)
aws cloudwatch put-metric-data \
--namespace CWAgent \
--metric-name test_metric \
--value 1 \
--unit Count 2>&1
# Restart agent after config change
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config -m ec2 -s \
-c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
Required IAM permissions for the agent:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogStreams",
"ec2:DescribeTags",
"ssm:GetParameter"
],
"Resource": "*"
}
]
}
Building Operational Dashboards¶
# Create a dashboard via CLI (JSON widget definitions)
aws cloudwatch put-dashboard \
--dashboard-name "production-overview" \
--dashboard-body '{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0, "width": 12, "height": 6,
"properties": {
"title": "CPU Utilization",
"metrics": [
["AWS/EC2", "CPUUtilization", "InstanceId", "i-0abc123", {"label": "web-01"}],
["AWS/EC2", "CPUUtilization", "InstanceId", "i-0def456", {"label": "web-02"}]
],
"period": 300,
"stat": "Average",
"view": "timeSeries"
}
},
{
"type": "metric",
"x": 12, "y": 0, "width": 12, "height": 6,
"properties": {
"title": "ALB Request Count & 5xx",
"metrics": [
["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/prod-alb/abc123", {"stat": "Sum"}],
["AWS/ApplicationELB", "HTTPCode_Target_5XX_Count", "LoadBalancer", "app/prod-alb/abc123", {"stat": "Sum", "color": "#d62728"}]
],
"period": 60,
"view": "timeSeries"
}
},
{
"type": "log",
"x": 0, "y": 6, "width": 24, "height": 6,
"properties": {
"title": "Error Rate",
"query": "SOURCE '\''/myapp/production'\'' | filter @message like /ERROR/ | stats count() by bin(5m)",
"region": "us-east-1",
"view": "timeSeries"
}
}
]
}'
# List dashboards
aws cloudwatch list-dashboards
# Get dashboard JSON (for version control)
aws cloudwatch get-dashboard --dashboard-name "production-overview"
Alarm-to-SNS-to-PagerDuty Pipeline¶
# 1. Create SNS topic
aws sns create-topic --name ops-critical
# Returns: arn:aws:sns:us-east-1:123456789012:ops-critical
# 2. Subscribe PagerDuty integration endpoint
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123456789012:ops-critical \
--protocol https \
--notification-endpoint "https://events.pagerduty.com/integration/YOUR_INTEGRATION_KEY/enqueue"
# 3. Also subscribe email for backup
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123456789012:ops-critical \
--protocol email \
--notification-endpoint "oncall@example.com"
# 4. Point alarms at the topic (--alarm-actions in put-metric-alarm)
# 5. Test the pipeline
aws sns publish \
--topic-arn arn:aws:sns:us-east-1:123456789012:ops-critical \
--message "Test alert — ignore" \
--subject "CloudWatch Test"
Embedded Metric Format for Lambda¶
Publish custom metrics from Lambda without API calls — write structured JSON to stdout and CloudWatch extracts metrics automatically:
# In Lambda function code
import json
def handler(event, context):
# Process request...
latency_ms = 42.5
# EMF log line — CloudWatch extracts metrics automatically
print(json.dumps({
"_aws": {
"Timestamp": 1679232000000,
"CloudWatchMetrics": [{
"Namespace": "Custom/MyLambda",
"Dimensions": [["FunctionName", "Environment"]],
"Metrics": [
{"Name": "Latency", "Unit": "Milliseconds"},
{"Name": "RequestCount", "Unit": "Count"}
]
}]
},
"FunctionName": "process-orders",
"Environment": "production",
"Latency": latency_ms,
"RequestCount": 1,
"message": "Request processed successfully"
}))
EMF avoids the $0.01/1000 PutMetricData API cost. The metrics appear in CloudWatch like any other custom metric, but you only pay log ingestion costs.