Ceph Storage — Street-Level Ops¶

Quick Diagnosis Commands¶

# First look — always start here
ceph -s
ceph health detail

# What's the cluster doing right now?
ceph -w              # live event stream, Ctrl-C to stop
watch -n5 ceph -s    # refresh every 5 seconds

# OSD map — which are up/in/out
ceph osd tree
ceph osd stat        # summary: X osds: Y up, Z in

# PG health
ceph pg stat
ceph pg dump_stuck   # PGs that aren't progressing
ceph pg dump_stuck inactive
ceph pg dump_stuck unclean
ceph pg dump_stuck undersized

# Per-OSD utilization
ceph osd df tree     # detailed with CRUSH tree
ceph osd perf        # latency per OSD

# IO stats
ceph iostat          # current cluster throughput/IOPS

Gotcha: OSD Full — Cluster Goes Read-Only¶

Rule: When any OSD hits full_ratio (default 0.95), all I/O to PGs on that OSD stops. The cluster becomes unwritable. nearfull_ratio (default 0.85) warns you first.

# Check fill levels
ceph df detail
ceph osd df        # per-OSD usage and fill ratio

# Emergency expansion options:
# 1. Increase the ratio temporarily (buys time, not a fix)
ceph osd set-full-ratio 0.97
ceph osd set-nearfull-ratio 0.87

# 2. Delete data to free space
# 3. Add OSDs
# 4. Remove a pool if it's non-critical

# Find the fattest objects
rados df              # per-pool raw usage
rados -p mypool ls | while read obj; do
  rados -p mypool stat "$obj"
done | sort -k4 -rn | head -20

Gotcha: Too Many OSDs Down — PGs Go Inactive¶

Rule: If enough OSDs go down simultaneously that a PG loses quorum (below min_size), that PG goes inactive and all I/O to it blocks. Do NOT blindly restart OSDs — figure out why they're down first.

# Identify down OSDs
ceph osd tree | grep down

# Check why an OSD is down — look at its logs
cephadm logs --name osd.<id> | tail -100
# or
journalctl -u ceph-osd@<id> | tail -100

# Determine if the disk is failed
smartctl -a /dev/sdX
lsblk

# If OSD process died and disk is healthy, restart it
ceph orch daemon restart osd.<id>
# or
systemctl start ceph-osd@<id>

# If disk is genuinely dead: mark out, wait for recovery, then remove
ceph osd out <id>
# wait for recovery to complete
watch -n10 ceph -s
# then: destroy the OSD
ceph osd destroy <id> --yes-i-really-mean-it
ceph osd rm <id>
ceph osd crush rm osd.<id>
ceph auth del osd.<id>

Gotcha: Stuck Peering¶

Under the hood: Peering is the process where OSDs agree on the authoritative copy of each object in a PG. It requires all OSDs in the PG's "up set" to communicate. If even one is unreachable, peering blocks until the OSD comes back or the admin intervenes.

Rule: PGs stuck in peering usually mean one or more OSDs involved in that PG are unreachable. The cluster is waiting for them to come back or for operator action.

# List PGs stuck in peering
ceph pg dump_stuck | grep peering

# Find which OSDs a PG maps to
ceph pg map <pgid>    # e.g., ceph pg map 1.4a

# Force recovery if you know the primary OSD has the authoritative copy
# (dangerous — only if OSDs are truly gone and you accept data loss)
ceph pg force-recovery <pgid>
ceph pg force-backfill <pgid>

# Mark a PG complete if it's stuck in incomplete (extreme last resort)
# This acknowledges potential data loss
ceph osd force-create-pg <pgid>

Pattern: Adding New OSDs Safely¶

Scale note: Adding a single OSD to a 100-OSD cluster moves roughly 1% of data. Adding 10 at once moves ~10%. On a 100 TB cluster, that is 10 TB of network I/O competing with client traffic. Stagger additions and throttle recovery to protect production workloads.

Never add a large batch of OSDs at once on a loaded cluster — CRUSH rebalancing will saturate your network.

# 1. Add one or a few OSDs
ceph orch daemon add osd nodeX:/dev/sdX

# 2. Watch rebalancing
watch -n5 'ceph -s | grep -E "health|degraded|misplaced|recover"'

# 3. Throttle if needed
ceph tell 'osd.*' injectargs '--osd-max-backfills 1'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 1'

# 4. After recovery completes (active+clean), add next batch
# 5. Re-enable full recovery speed after done
ceph tell 'osd.*' injectargs '--osd-max-backfills 3'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 3'

Pattern: Removing an OSD Safely¶

# 1. Mark out — CRUSH redistributes its PGs to other OSDs
ceph osd out <id>

# 2. Wait for data to migrate (can take minutes to hours depending on data)
watch -n10 ceph -s
# Wait until HEALTH_OK or only normal WARNs

# 3. Stop the daemon
ceph orch daemon stop osd.<id>
# or
systemctl stop ceph-osd@<id>

# 4. Remove from CRUSH, auth, and OSD map
ceph osd crush remove osd.<id>
ceph auth del osd.<id>
ceph osd rm <id>

# Verify
ceph osd tree

Scenario: Cluster Stuck at HEALTH_WARN — PGs Degraded After Node Reboot¶

Node rebooted, OSDs came back but cluster shows X pgs degraded.

# 1. Check overall state
ceph -s
# Typical output:
#   health: HEALTH_WARN
#     Degraded data redundancy: 142/1800 objects degraded (7.888%)

# 2. Confirm OSDs are back up
ceph osd tree | grep -E "down|out"

# 3. Check if recovery is progressing
watch -n5 'ceph -s | grep -E "recover|degrad|misplac"'
# Should show decreasing numbers

# 4. If recovery stalled, check for slow ops
ceph osd blocked-by
ceph daemon osd.<id> ops        # in-progress ops
ceph daemon osd.<id> dump_ops_in_flight

# 5. Speed up if recovery is too slow
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 8'
ceph tell 'osd.*' injectargs '--osd-max-backfills 4'

# 6. Typical resolution time: 10-60 min for small clusters; hours for large ones

Scenario: RBD Image Blocked — "rbd: sysfs write failed"¶

# Check if the OSD the image lives on is down
rbd status mypool/myimage   # shows watchers

# Identify PGs for this image
rbd info mypool/myimage     # shows object prefix
# Objects named: rbd_data.<prefix>.00000000000000XX

# Check PG state for one of those objects
ceph osd map mypool rbd_data.<prefix>.0000000000000001
# Returns: pgid, acting OSDs, state

# If the PG is inactive, find why those OSDs are down
ceph pg <pgid> query | python3 -m json.tool | grep -A5 "state"

Emergency: Monitor Quorum Lost¶

# Symptoms: ceph -s hangs or returns "mon is allowing insecure global_id reclaim"
# or "HEALTH_ERR 1 mon is down"

# Check monitor status
ceph mon stat
ceph mon dump

# Restart a failed MON
ceph orch daemon restart mon.<hostname>
# or
systemctl start ceph-mon@<hostname>

# If MON lost its data store (rare), use monmaptool to recover
# This is a disaster recovery procedure — see ceph docs "Recovery from Monitor Failures"

# Check MON logs
cephadm logs --name mon.<hostname> | tail -200
journalctl -u ceph-mon@<hostname> | tail -200

Emergency: Clock Skew Causing HEALTH_WARN¶

Debug clue: Clock skew warnings often appear after VM live migration or host suspend/resume. The guest clock drifts while the host is paused. If you see this after maintenance windows, check whether VMs were migrated.

Ceph requires NTP time sync within 0.05 seconds across all nodes.

# Identify which node has clock skew
ceph health detail | grep "clock skew"
# Output: "clock skew detected on mon.node2"

# Fix on the offending node
ssh node2 systemctl restart chronyd
ssh node2 chronyc makestep   # force immediate sync
ssh node2 chronyc tracking   # verify sync status

# Verify skew resolved
ceph health detail

Useful One-Liners¶

# Count PGs by state
ceph pg dump | tail -n+2 | awk '{print $10}' | sort | uniq -c | sort -rn

# Find objects in a specific pool
rados -p mypool ls | wc -l

# Check scrub errors
ceph pg dump | awk '$10 ~ /inconsistent/ || $10 ~ /repair/'

# List pools with their replica count and PG count
ceph osd dump | grep "^pool" | awk '{print $3, "size="$6, "pg_num="$4}'

# OSD utilization sorted by fill %
ceph osd df | sort -k7 -rn | head -20

# Show all client connections to OSDs
ceph daemon osd.0 sessions | head -30

# Identify the busiest OSDs (by ops)
ceph osd perf | sort -k3 -rn | head -10

# Tail the cluster log
ceph log last 100
ceph -w | grep -v "^$"

# Rook: exec into the toolbox
kubectl -n rook-ceph exec -it \
  $(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') \
  -- bash

# Check Rook operator reconcile loop
kubectl -n rook-ceph logs deploy/rook-ceph-operator --since=10m | grep -E "ERROR|WARN|reconcil"