devops
l1
topic-pack
elasticsearch --- Portal | Level: L1: Foundations | Topics: Elasticsearch | Domain: DevOps & Tooling

Elasticsearch - Primer¶

Why This Matters¶

Elasticsearch powers the search and analytics behind most centralized logging stacks (ELK/EFK), application search, and observability platforms. When it is healthy, teams query terabytes of logs in seconds. When it is not — red cluster health, unassigned shards, JVM heap pressure — your logging pipeline backs up, dashboards go blank, and you lose visibility into production right when you need it most.

Fun fact: Elasticsearch was created by Shay Banon in 2010 as a scalable wrapper around Apache Lucene (the full-text search library). The name comes from the idea of making Lucene "elastic" — horizontally scalable across many nodes. Elastic NV (now Elastic) was founded in 2012. The ELK stack (Elasticsearch + Logstash + Kibana) became the de facto open-source logging stack, processing petabytes of logs daily at companies like Netflix, Uber, and Wikipedia.

Remember: Elasticsearch cluster health colors: Green = all primary and replica shards allocated. Yellow = all primaries allocated but some replicas are not (single-node clusters are always yellow because replicas cannot be on the same node as primaries). Red = some primary shards are unallocated — data is missing and searches return incomplete results. Mnemonic: "Green is good, Yellow is warning, Red is losing data."

Core Concepts¶

1. Indices, Shards, and Replicas¶

An index is a collection of documents (like a database table). Each index is split into shards (for horizontal scaling) and each shard has replicas (for redundancy):

Index: app-logs-2024.01.15
  Primary Shard 0 (node-1)  →  Replica 0 (node-2)
  Primary Shard 1 (node-2)  →  Replica 1 (node-3)
  Primary Shard 2 (node-3)  →  Replica 2 (node-1)

# List indices with health, doc count, and size
curl -s localhost:9200/_cat/indices?v&s=index

# Output:
# health status index                 pri rep docs.count store.size
# green  open   app-logs-2024.01.15   3   1   15234567   12.3gb
# yellow open   app-logs-2024.01.14   3   1   14100000   11.1gb

# Create an index with explicit settings
curl -X PUT localhost:9200/app-logs-2024.01.16 -H 'Content-Type: application/json' -d '{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}'

Shard sizing rules:

Target 20-50GB per shard (larger shards are slow to recover)
Avoid thousands of tiny shards (each consumes heap memory)
number_of_replicas: 1 is standard (tolerates one node failure)

2. Cluster Health¶

The single most important check:

# Quick health check
curl -s localhost:9200/_cluster/health?pretty

# Output:
# {
#   "cluster_name": "production",
#   "status": "green",           <-- THE KEY FIELD
#   "number_of_nodes": 5,
#   "active_primary_shards": 150,
#   "active_shards": 300,
#   "unassigned_shards": 0       <-- non-zero = problem
# }

Status	Meaning	Action
`green`	All primary and replica shards allocated	None
`yellow`	All primaries allocated, some replicas missing	Check for down nodes, disk space
`red`	Some primary shards unallocated — data unavailable	Immediate investigation

# Find unassigned shards and why
curl -s localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state

# Detailed allocation explanation
curl -s localhost:9200/_cluster/allocation/explain?pretty

3. Node Roles¶

# List nodes with roles
curl -s localhost:9200/_cat/nodes?v&h=name,node.role,heap.percent,disk.used_percent,cpu

# Output:
# name     node.role  heap.percent  disk.used_percent  cpu
# master-1 m          45            32                 12
# data-1   d          72            68                 35
# data-2   d          65            71                 28
# coord-1  -          30            15                  8

Role	Flag	Purpose
Master	`m`	Cluster state management, shard allocation
Data	`d`	Stores data, executes searches
Ingest	`i`	Pre-processes documents (pipelines)
Coordinating	`-`	Routes queries, aggregates results
ML	`l`	Machine learning jobs

Production clusters should separate master and data roles (dedicated masters).

4. Mappings¶

Mappings define field types in an index (like a schema):

# View mapping for an index
curl -s localhost:9200/app-logs-2024.01.15/_mapping?pretty

# Common field types:
# keyword  — exact match, aggregations (hostname, status code)
# text     — full-text search, analyzed (log message)
# date     — timestamps
# long     — numeric
# boolean  — true/false
# ip       — IP addresses

# Create an index template (auto-apply mappings to new indices)
curl -X PUT localhost:9200/_index_template/app-logs -H 'Content-Type: application/json' -d '{
  "index_patterns": ["app-logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" },
        "message": { "type": "text" },
        "host": { "type": "keyword" },
        "duration_ms": { "type": "long" }
      }
    }
  }
}'

Under the hood: Elasticsearch's "shard" is actually a complete Lucene index. Each Lucene index consists of immutable "segments" — when new documents are indexed, they go into an in-memory buffer, then get flushed to a new segment. Segments are periodically merged in the background. This is why force-merging read-only indices (like old log indices) to a single segment saves heap and speeds queries — fewer segments means fewer file handles and less memory overhead per shard.

5. JVM Tuning¶

Elasticsearch runs on the JVM. Heap configuration is critical:

# Check JVM heap usage
curl -s localhost:9200/_nodes/stats/jvm?pretty | grep -A5 heap

# Key rules:
# - Set heap to 50% of available RAM (never exceed 50%)
# - Never exceed 30.5GB (compressed oops threshold)
# - Xms and Xmx must be equal

# /etc/elasticsearch/jvm.options
# -Xms16g
# -Xmx16g

# Monitor GC pressure
curl -s localhost:9200/_nodes/stats/jvm?pretty | grep -A10 gc

Symptom	Likely Cause	Fix
Heap > 85% constantly	Too many shards, too much data on node	Add nodes, reduce shard count
Long GC pauses (> 1s)	Heap too large (> 30.5GB) or too many concurrent queries	Reduce heap below 31GB, add coordinating nodes
Circuit breaker trips	Single query using too much memory	Add query size limits, optimize queries

6. Common Ops Tasks¶

Reindex — copy data between indices (mapping changes, shard count changes):

curl -X POST localhost:9200/_reindex -H 'Content-Type: application/json' -d '{
  "source": { "index": "app-logs-old" },
  "dest": { "index": "app-logs-new" }
}'

# Check reindex progress
curl -s localhost:9200/_tasks?actions=*reindex&detailed&pretty

Force merge — reduce segment count on read-only indices (saves heap and speeds queries):

# Only force-merge indices that are no longer being written to
curl -X POST localhost:9200/app-logs-2024.01.01/_forcemerge?max_num_segments=1

Snapshots — backup and restore:

# Register a snapshot repository
curl -X PUT localhost:9200/_snapshot/s3_backup -H 'Content-Type: application/json' -d '{
  "type": "s3",
  "settings": {
    "bucket": "es-backups",
    "region": "us-east-1"
  }
}'

# Take a snapshot
curl -X PUT "localhost:9200/_snapshot/s3_backup/snapshot_$(date +%Y%m%d)?wait_for_completion=false"

# List snapshots
curl -s localhost:9200/_snapshot/s3_backup/_all?pretty

# Restore a snapshot
curl -X POST localhost:9200/_snapshot/s3_backup/snapshot_20240115/_restore -H 'Content-Type: application/json' -d '{
  "indices": "app-logs-2024.01.15"
}'

Index Lifecycle Management (ILM) — automated retention:

# Create an ILM policy
curl -X PUT localhost:9200/_ilm/policy/log-retention -H 'Content-Type: application/json' -d '{
  "policy": {
    "phases": {
      "hot": { "actions": { "rollover": { "max_size": "50gb", "max_age": "1d" }}},
      "warm": { "min_age": "7d", "actions": { "shrink": { "number_of_shards": 1 }}},
      "cold": { "min_age": "30d", "actions": { "freeze": {} }},
      "delete": { "min_age": "90d", "actions": { "delete": {} }}
    }
  }
}'

Gotcha: The 30.5 GB JVM heap limit (also called the "compressed oops threshold") is one of the most critical Elasticsearch tuning facts. Below ~30.5 GB, the JVM uses compressed ordinary object pointers (compressed oops) — 32-bit pointers that address up to 32 GB. Above this threshold, the JVM switches to 64-bit pointers, effectively wasting ~30% of heap on pointer overhead. A node with 32 GB heap can actually have LESS usable memory than one with 30 GB. Always set Xmx to 30g or lower.

Default trap: New Elasticsearch clusters default to 1 replica per index. On a single-node cluster, this means every index is permanently yellow because the replica cannot be allocated to the same node as the primary. This is harmless but causes alarm fatigue. Either add a second node or set number_of_replicas: 0 for single-node setups.

7. Debugging Checklist¶

Cluster is RED:
  1. curl localhost:9200/_cluster/health?pretty
  2. curl localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason
  3. curl localhost:9200/_cluster/allocation/explain?pretty
  4. Check disk space on all data nodes: df -h
  5. Check node availability: curl localhost:9200/_cat/nodes?v

Indexing is slow:
  1. Check bulk queue: curl localhost:9200/_cat/thread_pool/write?v
  2. Check refresh interval (default 1s, can increase to 30s for bulk loads)
  3. Check merge activity: curl localhost:9200/_cat/segments?v
  4. Check disk I/O: iostat -x 1

Queries are slow:
  1. Enable slow query log
  2. Check heap pressure: curl localhost:9200/_nodes/stats/jvm?pretty
  3. Check shard count per index (too many small shards = overhead)
  4. Check coordinating node load
  5. Profile the query: POST /index/_search { "profile": true, ... }

Disk full:
  1. Delete old indices: curl -X DELETE localhost:9200/app-logs-2024.01.01
  2. Check ILM policy is running: curl localhost:9200/_ilm/status
  3. Reduce replicas temporarily: curl -X PUT localhost:9200/index/_settings -d '{"number_of_replicas":0}'
  4. Move shards off the full node with allocation rules

Key Takeaway¶

Elasticsearch ops centers on three things: cluster health (green/yellow/red — check it constantly), shard management (right size, right count, properly allocated), and JVM heap (stay under 85%, never exceed 30.5GB). Master the _cat and _cluster APIs, set up ILM for automated retention, and take regular snapshots. When things go wrong, start with _cluster/health and _cat/shards — they tell you exactly what is broken and where.

Elasticsearch Flashcards (CLI) (flashcard_deck, L1) — Elasticsearch

Elasticsearch - Primer¶

Why This Matters¶

Core Concepts¶

1. Indices, Shards, and Replicas¶

2. Cluster Health¶

3. Node Roles¶

4. Mappings¶

5. JVM Tuning¶

6. Common Ops Tasks¶

7. Debugging Checklist¶

Key Takeaway¶

Wiki Navigation¶

Pages that link here¶

Elasticsearch - Primer¶

Why This Matters¶

Core Concepts¶

1. Indices, Shards, and Replicas¶

2. Cluster Health¶

3. Node Roles¶

4. Mappings¶

5. JVM Tuning¶

6. Common Ops Tasks¶

7. Debugging Checklist¶

Key Takeaway¶

Wiki Navigation¶

Related Content¶

Pages that link here¶