Elasticsearch - Street-Level Ops¶

Real-world workflows for operating Elasticsearch clusters in production.

Quick Health Check¶

# One-liner cluster status
curl -s localhost:9200/_cluster/health | jq '{status, number_of_nodes, unassigned_shards}'

# Output:
# {
#   "status": "yellow",
#   "number_of_nodes": 3,
#   "unassigned_shards": 4
# }

# Node resource overview
curl -s localhost:9200/_cat/nodes?v&h=name,node.role,heap.percent,disk.used_percent,cpu

# Output:
# name     node.role  heap.percent  disk.used_percent  cpu
# data-1   d          72            68                 35
# data-2   d          65            71                 28
# master-1 m          45            32                 12

Finding Why Shards Are Unassigned¶

# List unassigned shards with reasons
curl -s localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason | grep UNASSIGNED

# Output:
# app-logs-2024.03.15 2 r UNASSIGNED NODE_LEFT
# app-logs-2024.03.15 4 r UNASSIGNED ALLOCATION_FAILED

# Detailed explanation for the first unassigned shard
curl -s localhost:9200/_cluster/allocation/explain?pretty

# Force retry allocation after fixing the underlying issue
curl -X POST localhost:9200/_cluster/reroute?retry_failed=true

Index Lifecycle Operations¶

# Check which indices are consuming the most disk
curl -s localhost:9200/_cat/indices?v&s=store.size:desc&h=index,health,pri,rep,docs.count,store.size | head -15

# Output:
# index                    health pri rep docs.count store.size
# app-logs-2024.03.14      green  3   1   22345678   18.5gb
# app-logs-2024.03.13      green  3   1   21100000   17.2gb
# metrics-2024.03.14       green  2   1   8500000    4.1gb

# Delete old indices to free disk
curl -X DELETE localhost:9200/app-logs-2024.02.*

# Reduce replicas on old indices to save space
curl -X PUT localhost:9200/app-logs-2024.03.01/_settings -H 'Content-Type: application/json' \
  -d '{"index":{"number_of_replicas":0}}'

# Force merge read-only indices (reduce segments, save heap)
curl -X POST localhost:9200/app-logs-2024.03.01/_forcemerge?max_num_segments=1

Diagnosing Slow Queries¶

# Enable slow query logging (threshold: 5 seconds)
curl -X PUT localhost:9200/app-logs-*/_settings -H 'Content-Type: application/json' -d '{
  "index.search.slowlog.threshold.query.warn": "5s",
  "index.search.slowlog.threshold.query.info": "2s"
}'

# Profile a specific query
curl -X POST localhost:9200/app-logs-2024.03.14/_search -H 'Content-Type: application/json' -d '{
  "profile": true,
  "query": { "match": { "message": "timeout" } }
}' | jq '.profile.shards[0].searches[0].query[0].time_in_nanos'

# Check thread pool queues and rejections
curl -s localhost:9200/_cat/thread_pool?v&h=node_name,name,active,queue,rejected | grep -E "search|write"

# Output:
# node_name  name    active  queue  rejected
# data-1     search  5       0      0
# data-1     write   12      45     3

Snapshot and Restore¶

# List available snapshots
curl -s localhost:9200/_snapshot/s3_backup/_all | jq '.snapshots[] | {snapshot: .snapshot, state: .state, indices: (.indices | length)}'

# Take a snapshot of specific indices
curl -X PUT "localhost:9200/_snapshot/s3_backup/snap-$(date +%Y%m%d-%H%M)" \
  -H 'Content-Type: application/json' -d '{
  "indices": "app-logs-2024.03.*",
  "ignore_unavailable": true
}'

# Monitor snapshot progress
curl -s localhost:9200/_snapshot/s3_backup/_current | jq '.snapshots[0].shards'

# Restore a single index from snapshot
curl -X POST localhost:9200/_snapshot/s3_backup/snap-20240315-0200/_restore \
  -H 'Content-Type: application/json' -d '{
  "indices": "app-logs-2024.03.10",
  "rename_pattern": "(.+)",
  "rename_replacement": "restored-$1"
}'

Emergency: Cluster RED¶

# 1. Get the full picture
curl -s localhost:9200/_cluster/health?pretty
curl -s localhost:9200/_cat/nodes?v
curl -s localhost:9200/_cat/shards?v | grep -c UNASSIGNED

# 2. Check disk watermarks (often the cause)
curl -s localhost:9200/_cat/allocation?v&h=node,disk.used,disk.avail,disk.percent

# 3. If disk full, temporarily raise watermark
curl -X PUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d '{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "92%",
    "cluster.routing.allocation.disk.watermark.high": "95%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
  }
}'

# 4. Delete indices to free space, then reset watermarks
curl -X DELETE localhost:9200/app-logs-2024.01.*
curl -X PUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' \
  -d '{"transient":{"cluster.routing.allocation.disk.watermark.*":null}}'

Node Drain Before Maintenance¶

# Exclude a node from allocation (drain its shards to other nodes)
curl -X PUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d '{
  "transient": {
    "cluster.routing.allocation.exclude._name": "data-3"
  }
}'

# Watch shards move off the node
watch 'curl -s localhost:9200/_cat/shards | grep data-3 | wc -l'

# After maintenance, re-include the node
curl -X PUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d '{
  "transient": {
    "cluster.routing.allocation.exclude._name": ""
  }
}'

JVM Heap Pressure¶

# Check heap usage across nodes
curl -s localhost:9200/_cat/nodes?v&h=name,heap.percent,heap.max,ram.percent

# Check GC stats for long pauses
curl -s localhost:9200/_nodes/stats/jvm | jq '.nodes | to_entries[] | {
  name: .value.name,
  heap_pct: .value.jvm.mem.heap_used_percent,
  old_gc_count: .value.jvm.gc.collectors.old.collection_count,
  old_gc_time_ms: .value.jvm.gc.collectors.old.collection_time_in_millis
}'

# Check fielddata usage (common heap hog)
curl -s localhost:9200/_cat/fielddata?v&h=node,field,size

Elasticsearch - Street-Level Ops¶

Quick Health Check¶

Finding Why Shards Are Unassigned¶

Index Lifecycle Operations¶

Diagnosing Slow Queries¶

Snapshot and Restore¶

Emergency: Cluster RED¶

Node Drain Before Maintenance¶

JVM Heap Pressure¶

Pages that link here¶