Portal | Level: L2: Operations | Topics: Ceph Storage | Domain: DevOps & Tooling

Ceph Storage — Primer¶

Why This Matters¶

Name origin: Ceph is named after "cephalopod" (octopus, squid) -- creatures with many arms, representing the distributed nature of the storage system. It was created by Sage Weil as part of his PhD research at UC Santa Cruz (2004-2007). Each major release is named after a cephalopod species in alphabetical order: Argonaut, Bobtail, ..., Quincy, Reef, Squid.

Ceph is the dominant open-source distributed storage system used under Kubernetes (via Rook-Ceph), OpenStack, and bare-metal clusters. When something breaks — OSD flapping, stuck PGs, degraded data — you need to understand the internals to know what the cluster is actually doing and whether it is safe to proceed. Ceph failures range from cosmetic (HEALTH_WARN on one OSD) to catastrophic (too many OSDs down, PGs go inactive). Understanding the architecture is prerequisite to not making it worse.

Core Concepts¶

1. RADOS — The Foundation¶

Everything in Ceph sits on RADOS (Reliable Autonomic Distributed Object Store). RADOS is a flat namespace of objects stored across OSDs, with no filesystem hierarchy. All higher-level services (RBD, CephFS, RGW) are clients of RADOS.

┌──────────┬──────────┬──────────┐
│   RBD    │  CephFS  │   RGW    │   ← clients / interfaces
└────┬─────┴────┬─────┴────┬─────┘
     │          │          │
     └──────────┴──────────┘
                │
           RADOS (librados)
                │
     ┌──────────┴──────────┐
     │      OSD daemons    │   ← one per disk
     └─────────────────────┘

2. Daemon Roles¶

OSD (Object Storage Daemon) - One daemon per physical disk (or partition) - Stores objects, handles replication/erasure coding locally - Reports health to monitors, participates in peering - Uses BlueStore by default (direct writes to raw block device, no filesystem overhead)

Monitor (MON) - Maintains the authoritative cluster map: CRUSH map, OSD map, PG map, MDS map - Odd number required (3 or 5 for quorum; 1 for single-node dev only) - Clients consult monitors to bootstrap, then cache maps locally - Election protocol: Paxos variant

Manager (MGR) - Collects and exposes metrics, hosts the dashboard and REST API - Runs modules: balancer, prometheus exporter, pg_autoscaler, orchestrator - At least 2 recommended (active + standby)

MDS (Metadata Server) - Required only for CephFS - Manages namespace (directories, inodes), not data blocks - Data blocks stored in RADOS like everything else - Can run multiple active MDS for high-throughput metadata workloads

3. CRUSH Map and Failure Domains¶

CRUSH (Controlled Replication Under Scalable Hashing) is a deterministic algorithm that maps an object to a set of OSDs without a central lookup table. Every client computes OSD placement independently.

Under the hood: CRUSH is what makes Ceph fundamentally different from storage systems that use a central metadata server for data placement (like HDFS NameNode). Because CRUSH is deterministic and every client has the map, there is no single point of failure for data routing. The trade-off: changing the CRUSH map (adding/removing OSDs) triggers data movement, which can be slow and I/O-intensive.

# View the CRUSH map
ceph osd crush tree --show-shadow
ceph osd crush dump

# Extract, decompile, edit, recompile
ceph osd getcrushmap -o crushmap.bin
crushtool -d crushmap.bin -o crushmap.txt
# edit crushmap.txt
crushtool -c crushmap.txt -o newcrush.bin
ceph osd setcrushmap -i newcrush.bin

Failure domains define what "separate" means. You configure them as CRUSH bucket types:

# Typical hierarchy
root → datacenter → rack → host → osd

# Default rule (host-level failure domain)
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

With 3-way replication and type host, each replica lands on a different host. With type rack, each replica lands in a different rack. If you only have 2 hosts but ask for type host with replication size 3, PGs will be undersized and stuck.

4. Pool Types¶

Replicated pool

# Create a 3-replica pool
ceph osd pool create mypool replicated
ceph osd pool set mypool size 3        # total copies
ceph osd pool set mypool min_size 2    # min copies for I/O

Erasure-coded pool

# Create an erasure coding profile: 4 data + 2 parity (k=4, m=2)
ceph osd erasure-code-profile set myprofile k=4 m=2 plugin=jerasure technique=reed_sol_van
ceph osd pool create ec-pool erasure myprofile

# 4+2 tolerates 2 OSD failures, uses 1.5x raw space vs 3x for replication
# Cannot use for RBD without cache tier or bluestore EC overwrites enabled

PG (Placement Group) count

# Modern Ceph (Quincy+): use pg_autoscaler
ceph osd pool set mypool pg_autoscale_mode on
ceph osd pool autoscale-status

# Manual calculation: target_pgs_per_osd * osd_count / replica_count
# Rule of thumb: 100 PGs per OSD. For 30 OSDs, 3-replica: 30*100/3 = 1000 PGs
ceph osd pool set mypool pg_num 256
ceph osd pool set mypool pgp_num 256

5. RBD — Block Device¶

RBD images are striped across RADOS objects (default 4 MiB per object). Used for VM disks, Kubernetes PVCs.

# Create a pool and enable RBD application
ceph osd pool create rbd-pool replicated
ceph osd pool application enable rbd-pool rbd

# Create, map, format, mount
rbd create --size 20480 rbd-pool/myimage   # 20 GiB
rbd map rbd-pool/myimage                   # returns /dev/rbdX
mkfs.ext4 /dev/rbd0
mount /dev/rbd0 /mnt/rbd

# Snapshots
rbd snap create rbd-pool/myimage@snap1
rbd snap ls rbd-pool/myimage
rbd snap rollback rbd-pool/myimage@snap1

# Resize (online for ext4/xfs with kernel client)
rbd resize --size 40960 rbd-pool/myimage
resize2fs /dev/rbd0

# Image info
rbd info rbd-pool/myimage
rbd du rbd-pool/myimage   # actual used space (thin provisioning)

6. CephFS — Filesystem¶

CephFS presents a POSIX filesystem backed by two RADOS pools: one for metadata, one for data.

# Create a filesystem (Ceph creates pools automatically or use existing)
ceph fs new myfs cephfs_meta cephfs_data

# Check status
ceph fs status myfs
ceph mds stat

# Mount with kernel client
mount -t ceph mon1:6789,mon2:6789,mon3:6789:/ /mnt/cephfs \
  -o name=admin,secret=$(ceph auth get-key client.admin)

# Mount with ceph-fuse
ceph-fuse -n client.admin /mnt/cephfs

# Create and manage sub-volumes (for Kubernetes)
ceph fs subvolume create myfs subvol1 --group_name group1
ceph fs subvolume getpath myfs subvol1 --group_name group1

7. RGW — S3/Swift Object Gateway¶

RGW provides an S3-compatible and Swift-compatible HTTP API on top of RADOS.

# Deploy via cephadm
ceph orch apply rgw myzone '--placement=3'

# Create a user
radosgw-admin user create --uid=s3user --display-name="S3 User" \
  --access-key=AKIAIOSFODNN7EXAMPLE --secret-key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

# Quota management
radosgw-admin quota set --quota-scope=user --uid=s3user --max-size=10G
radosgw-admin quota enable --quota-scope=user --uid=s3user

# Use with AWS CLI
aws --endpoint-url=http://rgw-host:7480 s3 ls
aws --endpoint-url=http://rgw-host:7480 s3 mb s3://mybucket

8. Deployment with cephadm¶

cephadm is the official bootstrapper for Ceph Reef and later. It manages daemons in containers.

# Bootstrap a single-node cluster
cephadm bootstrap --mon-ip 10.0.0.10 \
  --initial-dashboard-user admin \
  --initial-dashboard-password changeme

# Add hosts
ceph orch host add node2 10.0.0.11
ceph orch host add node3 10.0.0.12

# Deploy additional MONs and MGRs
ceph orch apply mon 3
ceph orch apply mgr 2

# Add OSDs — all available disks on all hosts
ceph orch apply osd --all-available-devices

# Or specific disks
ceph orch daemon add osd node2:/dev/sdb
ceph orch daemon add osd node2:/dev/sdc

# Check daemon status
ceph orch ls
ceph orch ps

9. Health States¶

# Cluster overview
ceph -s         # compact status
ceph status     # same, more verbose
ceph health     # one-line health string
ceph health detail  # full explanation

# OSD tree — shows weights, crush location, up/in state
ceph osd tree

# Pool usage
ceph df
ceph df detail

# PG summary
ceph pg stat
ceph pg dump summary

War story: A common Ceph disaster scenario: an admin marks too many OSDs out at once during maintenance, dropping below min_size for some PGs. Those PGs go inactive and all I/O to those pools freezes. The fix is to mark OSDs back in, but recovery takes time. Rule: never take out more OSDs than your replication level minus min_size allows. With size=3 and min_size=2, you can lose 1 OSD per failure domain safely.

Remember: Mnemonic for Ceph health: "OK is green, WARN is data-safe, ERR is data-at-risk." HEALTH_WARN means something is suboptimal but no data loss is imminent. HEALTH_ERR means act now or risk losing data.

HEALTH_OK: No issues.

HEALTH_WARN: Something is suboptimal but data is safe. Examples: - 1 osds down — one OSD died, replicas serving reads - Reduced data availability — some PGs degraded - clock skew detected — NTP drift between nodes - nearfull osd(s) — OSD >85% full (full at 95%)

HEALTH_ERR: Data availability or durability at risk. Examples: - X pgs degraded — fewer copies than size, but still min_size - X pgs undersized — fewer copies than min_size, I/O blocked - X pgs incomplete — can't determine authoritative history, I/O blocked - mon quorum lost — cluster unresponsive

10. Common PG States¶

# Check PG states
ceph pg dump | grep -v "^pg_stat" | awk '{print $1, $10}'
ceph pg dump_stuck       # PGs stuck in non-active+clean states
ceph pg dump_stuck unclean
ceph pg dump_stuck inactive

State	Meaning
`active+clean`	Normal. All replicas present, no operations pending.
`active+degraded`	Some objects have fewer replicas than target, but I/O allowed.
`active+undersized`	Fewer replicas than min_size — may still allow reads.
`peering`	OSDs negotiating to agree on authoritative object state.
`stale`	Primary OSD hasn't reported PG status recently.
`inactive`	PG not active — I/O blocked. Usually means too many OSDs down.
`incomplete`	Cannot determine authoritative history — needs recovery or manual intervention.
`backfilling`	New OSD receiving objects it's responsible for.
`recovering`	Replicating objects after OSD returned from down state.
`remapped`	PG temporarily mapped to a different OSD set during CRUSH changes.

11. OSD Recovery Tuning¶

During recovery, Ceph competes with client I/O. Tune to balance:

# Allow more recovery bandwidth (at the cost of client throughput)
ceph tell 'osd.*' injectargs '--osd-max-backfills 4'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
ceph tell 'osd.*' injectargs '--osd-recovery-sleep 0'

# Throttle recovery (protect client I/O during business hours)
ceph tell 'osd.*' injectargs '--osd-max-backfills 1'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 1'
ceph tell 'osd.*' injectargs '--osd-recovery-sleep 0.1'

# Check recovery progress
ceph -s | grep -A5 "io:"
watch -n2 ceph -s

12. Rook-Ceph on Kubernetes¶

Rook is the Kubernetes operator that manages Ceph lifecycle.

# CephCluster CR
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: quay.io/ceph/ceph:v18.2.0
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: false
  mgr:
    count: 2
  storage:
    useAllNodes: true
    useAllDevices: false
    deviceFilter: "^sd[b-z]"   # only use /dev/sdb+ not /dev/sda (OS disk)

# Rook toolbox pod for ceph commands
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
ceph -s

# Storage class for RBD
kubectl get storageclass rook-ceph-block
# Storage class for CephFS
kubectl get storageclass rook-cephfs

# Check operator logs on issues
kubectl -n rook-ceph logs deploy/rook-ceph-operator -f

# CephBlockPool and StorageClass
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool
  namespace: rook-ceph
spec:
  failureDomain: host
  replicated:
    size: 3
    requireSafeReplicaSize: true

Quick Reference¶

# Cluster health
ceph -s
ceph health detail
ceph osd tree

# Pool and usage
ceph df
ceph osd pool ls detail
ceph osd dump | grep full_ratio  # default 0.95

# OSD operations
ceph osd out <id>           # remove from CRUSH (triggers rebalance)
ceph osd in <id>            # re-add to CRUSH
ceph osd down <id>          # mark down (dangerous — causes peering)
ceph osd crush remove osd.<id>
ceph osd rm <id>

# PG operations
ceph pg repair <pgid>
ceph pg scrub <pgid>
ceph pg deep-scrub <pgid>

# Auth
ceph auth list
ceph auth get-key client.admin
ceph auth add client.myapp mon 'allow r' osd 'allow rw pool=mypool'

# Logs
ceph log last 50
ceph -w   # live event stream

Prerequisites¶

Storage Operations (Topic Pack, L2)
Kubernetes Ops (Production) (Topic Pack, L2)