Skip to content

AWS CloudWatch - Primer

Why This Matters

CloudWatch is AWS's built-in observability platform — metrics, logs, alarms, and dashboards in one service. Every AWS resource emits CloudWatch metrics by default. When your EC2 instance is at 100% CPU, when your Lambda times out, when your RDS storage fills up — CloudWatch is where you see it first. Understanding CloudWatch deeply is the difference between proactive monitoring and getting paged at 3 AM because nobody set up an alarm.

CloudWatch Metrics

Metrics are the core primitive. A metric is a time-ordered set of data points identified by a namespace, metric name, and dimensions.

Namespaces, Dimensions, and Statistics

# List all metric namespaces in your account
aws cloudwatch list-metrics --output text | awk '{print $1}' | sort -u

# Common namespaces:
# AWS/EC2        — instance CPU, network, disk
# AWS/RDS        — database connections, IOPS, free storage
# AWS/ELB        — request count, latency, HTTP errors
# AWS/Lambda     — invocations, duration, errors, throttles
# AWS/S3         — bucket size, object count
# Custom/MyApp   — your application metrics

# List metrics in a namespace
aws cloudwatch list-metrics --namespace AWS/EC2

# List metrics for a specific instance
aws cloudwatch list-metrics --namespace AWS/EC2 \
  --dimensions Name=InstanceId,Value=i-0abc123def456

# Get CPU utilization for an instance (last hour, 5-min periods)
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123def456 \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average Maximum

Dimensions are key-value pairs that identify a specific metric stream. An EC2 instance metric has InstanceId as a dimension. An ELB metric has LoadBalancerName. You cannot query across dimensions without creating a math expression or aggregation.

Statistics: Average, Sum, Minimum, Maximum, SampleCount. Extended statistics give percentiles: p50, p90, p99.

Periods: The granularity of aggregation. Default EC2 metrics are 5-minute periods (300 seconds). Detailed monitoring gives 1-minute periods (costs extra).

Custom Metrics

# Publish a custom metric
aws cloudwatch put-metric-data \
  --namespace "Custom/MyApp" \
  --metric-name "QueueDepth" \
  --value 42 \
  --unit Count \
  --dimensions Name=Environment,Value=production Name=Queue,Value=orders

# Publish with timestamp
aws cloudwatch put-metric-data \
  --namespace "Custom/MyApp" \
  --metric-data '[{
    "MetricName": "RequestLatency",
    "Value": 123.45,
    "Unit": "Milliseconds",
    "Timestamp": "2026-03-19T10:00:00Z",
    "Dimensions": [
      {"Name": "Service", "Value": "api"},
      {"Name": "Environment", "Value": "production"}
    ]
  }]'

# Publish high-resolution metric (1-second granularity)
aws cloudwatch put-metric-data \
  --namespace "Custom/MyApp" \
  --metric-data '[{
    "MetricName": "ActiveConnections",
    "Value": 150,
    "StorageResolution": 1,
    "Unit": "Count"
  }]'

Custom metrics cost $0.30/metric/month. High-resolution metrics cost $0.30/metric/month plus higher API costs. Plan your dimensions carefully — each unique combination of dimensions creates a separate metric stream.

CloudWatch Logs

Log Groups and Log Streams

# List log groups
aws logs describe-log-groups --query 'logGroups[].logGroupName' --output table

# Create a log group
aws logs create-log-group --log-group-name /myapp/production

# Set retention (default is forever — costs money)
aws logs put-retention-policy \
  --log-group-name /myapp/production \
  --retention-in-days 30

# Common retention values: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, 3653

# List log streams in a group
aws logs describe-log-streams \
  --log-group-name /myapp/production \
  --order-by LastEventTime --descending \
  --limit 10

# Get recent log events
aws logs get-log-events \
  --log-group-name /myapp/production \
  --log-stream-name "i-0abc123/app.log" \
  --start-from-head \
  --limit 50

# Tail logs in real time (AWS CLI v2)
aws logs tail /myapp/production --follow --since 5m

CloudWatch Logs Insights

Logs Insights is a query language for searching and analyzing log data:

# Run a Logs Insights query
aws logs start-query \
  --log-group-name /myapp/production \
  --start-time $(date -u -d '1 hour ago' +%s) \
  --end-time $(date -u +%s) \
  --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50'

# Get query results (use query-id from start-query response)
aws logs get-query-results --query-id "12345678-abcd-efgh-ijkl-123456789012"

Common Logs Insights queries:

# Error rate per 5 minutes
filter @message like /ERROR/
| stats count() as error_count by bin(5m)

# Latency percentiles
filter @message like /request completed/
| parse @message '"latency_ms":*,' as latency
| stats avg(latency) as avg_ms, pct(latency, 95) as p95, pct(latency, 99) as p99 by bin(5m)

# Top 10 most frequent error messages
filter @message like /ERROR/
| parse @message 'ERROR: *' as error_msg
| stats count() as cnt by error_msg
| sort cnt desc
| limit 10

# Find requests slower than 5 seconds
filter @message like /duration/
| parse @message 'duration=* ms' as duration
| filter duration > 5000
| sort duration desc
| limit 20

# Lambda cold starts
filter @type = "REPORT"
| parse @message 'Init Duration: * ms' as init_duration
| filter ispresent(init_duration)
| stats count() as cold_starts, avg(init_duration) as avg_init by bin(1h)

Metric Filters

Metric filters extract numeric values from log data and publish them as CloudWatch metrics:

# Create a metric filter that counts ERROR log lines
aws logs put-metric-filter \
  --log-group-name /myapp/production \
  --filter-name ErrorCount \
  --filter-pattern "ERROR" \
  --metric-transformations \
    metricName=ErrorCount,metricNamespace=Custom/MyApp,metricValue=1,defaultValue=0

# Filter pattern matching JSON logs
# { $.statusCode = 500 }
# { $.latency > 5000 }
# { $.level = "ERROR" && $.service = "payment" }

# Create metric filter for JSON log latency
aws logs put-metric-filter \
  --log-group-name /myapp/production \
  --filter-name HighLatency \
  --filter-pattern '{ $.latency > 5000 }' \
  --metric-transformations \
    metricName=HighLatencyCount,metricNamespace=Custom/MyApp,metricValue=1,defaultValue=0

CloudWatch Alarms

Threshold Alarms

# Create a CPU alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "prod-web-01-high-cpu" \
  --alarm-description "CPU above 80% for 10 minutes" \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123def456 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --datapoints-to-alarm 2 \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
  --ok-actions arn:aws:sns:us-east-1:123456789012:ops-alerts

# Create a disk space alarm (requires CloudWatch Agent)
aws cloudwatch put-metric-alarm \
  --alarm-name "prod-web-01-disk-full" \
  --namespace CWAgent \
  --metric-name disk_used_percent \
  --dimensions Name=InstanceId,Value=i-0abc123def456 Name=path,Value=/ Name=fstype,Value=ext4 \
  --statistic Average \
  --period 300 \
  --threshold 85 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --treat-missing-data breaching \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-critical

# Describe an alarm (check its current state)
aws cloudwatch describe-alarms --alarm-names "prod-web-01-high-cpu"

# List all alarms in ALARM state
aws cloudwatch describe-alarms --state-value ALARM

Alarm states: OK, ALARM, INSUFFICIENT_DATA.

Composite Alarms

Composite alarms combine multiple alarms with Boolean logic:

# Create a composite alarm: ALARM when BOTH CPU high AND error rate high
aws cloudwatch put-composite-alarm \
  --alarm-name "prod-service-degraded" \
  --alarm-rule 'ALARM("prod-web-01-high-cpu") AND ALARM("prod-error-rate-high")' \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-critical

Missing Data Treatment

Option Behavior
breaching Treat missing data as exceeding the threshold
notBreaching Treat missing data as within the threshold
ignore Keep current alarm state
missing Alarm enters INSUFFICIENT_DATA state

Choose deliberately. The default (missing) causes alarms to flip to INSUFFICIENT_DATA during low-traffic periods, which is often wrong.

CloudWatch Agent

The agent collects OS-level metrics (memory, disk, process-level) and forwards logs:

# Install on Amazon Linux 2 / RHEL
sudo yum install amazon-cloudwatch-agent

# Install on Ubuntu/Debian
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i amazon-cloudwatch-agent.deb

# Run the configuration wizard
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

# Or use a config file directly
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config \
  -m ec2 \
  -s \
  -c file:/opt/aws/amazon-cloudwatch-agent/etc/config.json

Agent Configuration

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "CWAgent",
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}",
      "AutoScalingGroupName": "${aws:AutoScalingGroupName}"
    },
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent", "mem_available"],
        "metrics_collection_interval": 60
      },
      "disk": {
        "measurement": ["disk_used_percent", "disk_free"],
        "resources": ["/", "/data"],
        "metrics_collection_interval": 60
      },
      "procstat": [
        {
          "pattern": "nginx",
          "measurement": ["cpu_usage", "memory_rss", "pid_count"]
        },
        {
          "pattern": "postgres",
          "measurement": ["cpu_usage", "memory_rss"]
        }
      ]
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/myapp/app.log",
            "log_group_name": "/myapp/production",
            "log_stream_name": "{instance_id}/app.log",
            "retention_in_days": 30
          },
          {
            "file_path": "/var/log/nginx/error.log",
            "log_group_name": "/nginx/production",
            "log_stream_name": "{instance_id}/error.log",
            "retention_in_days": 14
          }
        ]
      }
    }
  }
}
# Check agent status
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a status

# Restart agent
sudo systemctl restart amazon-cloudwatch-agent

# View agent logs
tail -f /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log

CloudWatch Events / EventBridge

EventBridge (the successor to CloudWatch Events) routes events between AWS services:

# Create a scheduled rule (cron)
aws events put-rule \
  --name "daily-snapshot" \
  --schedule-expression "cron(0 2 * * ? *)" \
  --state ENABLED \
  --description "Daily EBS snapshot at 2 AM UTC"

# Create an event pattern rule (react to EC2 state changes)
aws events put-rule \
  --name "ec2-state-change" \
  --event-pattern '{
    "source": ["aws.ec2"],
    "detail-type": ["EC2 Instance State-change Notification"],
    "detail": {"state": ["stopped", "terminated"]}
  }'

# Add a target (SNS notification)
aws events put-targets \
  --rule "ec2-state-change" \
  --targets "Id"="1","Arn"="arn:aws:sns:us-east-1:123456789012:ops-alerts"

# Add a Lambda target
aws events put-targets \
  --rule "daily-snapshot" \
  --targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789012:function:create-snapshots"

CloudWatch Synthetics (Canaries)

Canaries are configurable scripts that run on a schedule to monitor endpoints:

# List canaries
aws synthetics describe-canaries

# Get canary run results
aws synthetics get-canary-runs --name my-api-canary --max-results 5

# Canary script example (Node.js, checks HTTPS endpoint)
# Deployed via CloudFormation or CDK typically

Canaries test your endpoints from the outside — they catch issues that internal health checks miss (DNS resolution, certificate expiry, CDN problems).

Container Insights

Container Insights collects metrics and logs from ECS and EKS:

# Enable Container Insights on an ECS cluster
aws ecs update-cluster-settings \
  --cluster production \
  --settings name=containerInsights,value=enabled

# For EKS, deploy the CloudWatch agent as a DaemonSet
# Metrics appear under the ContainerInsights namespace

# Query Container Insights metrics
aws cloudwatch get-metric-statistics \
  --namespace ContainerInsights \
  --metric-name pod_cpu_utilization \
  --dimensions Name=ClusterName,Value=production Name=Namespace,Value=default \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

Cost Considerations

Item Cost
Custom metrics $0.30/metric/month (first 10K), decreasing tiers after
API calls (GetMetricData) $0.01 per 1,000 metrics requested
Log ingestion $0.50/GB
Log storage $0.03/GB/month
Dashboard $3.00/month per dashboard (first 3 free)
Alarms (standard) $0.10/alarm/month
Alarms (high-res) $0.30/alarm/month
Canary runs $0.0012/run
Contributor Insights $0.02 per matching log event per rule

The two big cost drivers are log ingestion and custom metrics. A verbose application logging 100GB/day costs $50/day in ingestion alone. Each unique dimension combination counts as a separate metric.

Key Takeaway

CloudWatch is three things: metrics (numbers over time, from AWS services and your code), logs (searchable text streams with Insights query language), and alarms (thresholds that trigger actions). Master Logs Insights queries for debugging, set up alarms with deliberate missing-data treatment, install the CloudWatch Agent for memory and disk metrics that EC2 does not provide by default, and watch your log ingestion costs before they surprise you on the bill.


Wiki Navigation

Prerequisites

  • AWS Devops Flashcards (CLI) (flashcard_deck, L1) — Cloud Deep Dive
  • AWS EC2 (Topic Pack, L1) — Cloud Deep Dive
  • AWS ECS (Topic Pack, L2) — Cloud Deep Dive
  • AWS General Flashcards (CLI) (flashcard_deck, L1) — Cloud Deep Dive
  • AWS IAM (Topic Pack, L1) — Cloud Deep Dive
  • AWS Lambda (Topic Pack, L2) — Cloud Deep Dive
  • AWS Networking (Topic Pack, L1) — Cloud Deep Dive
  • AWS Route 53 (Topic Pack, L2) — Cloud Deep Dive
  • AWS S3 Deep Dive (Topic Pack, L1) — Cloud Deep Dive
  • Azure Flashcards (CLI) (flashcard_deck, L1) — Cloud Deep Dive