Elasticsearch - Street-Level Ops¶
Real-world workflows for operating Elasticsearch clusters in production.
Quick Health Check¶
# One-liner cluster status
curl -s localhost:9200/_cluster/health | jq '{status, number_of_nodes, unassigned_shards}'
# Output:
# {
# "status": "yellow",
# "number_of_nodes": 3,
# "unassigned_shards": 4
# }
# Node resource overview
curl -s localhost:9200/_cat/nodes?v&h=name,node.role,heap.percent,disk.used_percent,cpu
# Output:
# name node.role heap.percent disk.used_percent cpu
# data-1 d 72 68 35
# data-2 d 65 71 28
# master-1 m 45 32 12
Finding Why Shards Are Unassigned¶
# List unassigned shards with reasons
curl -s localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason | grep UNASSIGNED
# Output:
# app-logs-2024.03.15 2 r UNASSIGNED NODE_LEFT
# app-logs-2024.03.15 4 r UNASSIGNED ALLOCATION_FAILED
# Detailed explanation for the first unassigned shard
curl -s localhost:9200/_cluster/allocation/explain?pretty
# Force retry allocation after fixing the underlying issue
curl -X POST localhost:9200/_cluster/reroute?retry_failed=true
Index Lifecycle Operations¶
# Check which indices are consuming the most disk
curl -s localhost:9200/_cat/indices?v&s=store.size:desc&h=index,health,pri,rep,docs.count,store.size | head -15
# Output:
# index health pri rep docs.count store.size
# app-logs-2024.03.14 green 3 1 22345678 18.5gb
# app-logs-2024.03.13 green 3 1 21100000 17.2gb
# metrics-2024.03.14 green 2 1 8500000 4.1gb
# Delete old indices to free disk
curl -X DELETE localhost:9200/app-logs-2024.02.*
# Reduce replicas on old indices to save space
curl -X PUT localhost:9200/app-logs-2024.03.01/_settings -H 'Content-Type: application/json' \
-d '{"index":{"number_of_replicas":0}}'
# Force merge read-only indices (reduce segments, save heap)
curl -X POST localhost:9200/app-logs-2024.03.01/_forcemerge?max_num_segments=1
Diagnosing Slow Queries¶
# Enable slow query logging (threshold: 5 seconds)
curl -X PUT localhost:9200/app-logs-*/_settings -H 'Content-Type: application/json' -d '{
"index.search.slowlog.threshold.query.warn": "5s",
"index.search.slowlog.threshold.query.info": "2s"
}'
# Profile a specific query
curl -X POST localhost:9200/app-logs-2024.03.14/_search -H 'Content-Type: application/json' -d '{
"profile": true,
"query": { "match": { "message": "timeout" } }
}' | jq '.profile.shards[0].searches[0].query[0].time_in_nanos'
# Check thread pool queues and rejections
curl -s localhost:9200/_cat/thread_pool?v&h=node_name,name,active,queue,rejected | grep -E "search|write"
# Output:
# node_name name active queue rejected
# data-1 search 5 0 0
# data-1 write 12 45 3
Snapshot and Restore¶
# List available snapshots
curl -s localhost:9200/_snapshot/s3_backup/_all | jq '.snapshots[] | {snapshot: .snapshot, state: .state, indices: (.indices | length)}'
# Take a snapshot of specific indices
curl -X PUT "localhost:9200/_snapshot/s3_backup/snap-$(date +%Y%m%d-%H%M)" \
-H 'Content-Type: application/json' -d '{
"indices": "app-logs-2024.03.*",
"ignore_unavailable": true
}'
# Monitor snapshot progress
curl -s localhost:9200/_snapshot/s3_backup/_current | jq '.snapshots[0].shards'
# Restore a single index from snapshot
curl -X POST localhost:9200/_snapshot/s3_backup/snap-20240315-0200/_restore \
-H 'Content-Type: application/json' -d '{
"indices": "app-logs-2024.03.10",
"rename_pattern": "(.+)",
"rename_replacement": "restored-$1"
}'
Emergency: Cluster RED¶
# 1. Get the full picture
curl -s localhost:9200/_cluster/health?pretty
curl -s localhost:9200/_cat/nodes?v
curl -s localhost:9200/_cat/shards?v | grep -c UNASSIGNED
# 2. Check disk watermarks (often the cause)
curl -s localhost:9200/_cat/allocation?v&h=node,disk.used,disk.avail,disk.percent
# 3. If disk full, temporarily raise watermark
curl -X PUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d '{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "92%",
"cluster.routing.allocation.disk.watermark.high": "95%",
"cluster.routing.allocation.disk.watermark.flood_stage": "97%"
}
}'
# 4. Delete indices to free space, then reset watermarks
curl -X DELETE localhost:9200/app-logs-2024.01.*
curl -X PUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' \
-d '{"transient":{"cluster.routing.allocation.disk.watermark.*":null}}'
Node Drain Before Maintenance¶
# Exclude a node from allocation (drain its shards to other nodes)
curl -X PUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d '{
"transient": {
"cluster.routing.allocation.exclude._name": "data-3"
}
}'
# Watch shards move off the node
watch 'curl -s localhost:9200/_cat/shards | grep data-3 | wc -l'
# After maintenance, re-include the node
curl -X PUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d '{
"transient": {
"cluster.routing.allocation.exclude._name": ""
}
}'
JVM Heap Pressure¶
# Check heap usage across nodes
curl -s localhost:9200/_cat/nodes?v&h=name,heap.percent,heap.max,ram.percent
# Check GC stats for long pauses
curl -s localhost:9200/_nodes/stats/jvm | jq '.nodes | to_entries[] | {
name: .value.name,
heap_pct: .value.jvm.mem.heap_used_percent,
old_gc_count: .value.jvm.gc.collectors.old.collection_count,
old_gc_time_ms: .value.jvm.gc.collectors.old.collection_time_in_millis
}'
# Check fielddata usage (common heap hog)
curl -s localhost:9200/_cat/fielddata?v&h=node,field,size