Skip to content

Ceph Storage Footguns


1. Removing an OSD before data migrates off it

You run ceph osd destroy <id> immediately after ceph osd out <id> without waiting for rebalancing to complete. The PGs that had their only surviving replica on this OSD are now lost.

Fix: After ceph osd out <id>, watch ceph -s until the cluster returns to HEALTH_OK (or at least until degraded count reaches 0). Only then destroy/remove the OSD. The wait time scales with how much data was on that OSD.


2. Setting min_size 1 to silence HEALTH_WARN

You have a degraded cluster and want to suppress the warning. You set ceph osd pool set mypool min_size 1. Now Ceph will serve reads and writes with only a single surviving replica — silently. If that OSD dies, data is gone.

Fix: Never lower min_size below (size / 2) + 1 in production. Fix the actual problem: add OSDs, repair the failed OSD, or scale down size in a controlled way.

Default trap: Some Ceph deployment guides suggest min_size 1 for small test clusters. If that config gets copy-pasted to production, a single OSD failure can cause silent data loss — Ceph will happily serve reads from the one remaining replica and accept writes to it, with zero redundancy. The HEALTH_WARN for degraded PGs is easy to miss in a noisy alert environment.


3. Creating pools with too many PGs (or too few)

You create 10 pools each with 128 PGs on a 5-OSD cluster. That's 1280 PGs across 5 OSDs = 256 PGs per OSD. Ceph's recommended maximum is 250–300 PGs per OSD. Memory usage per OSD balloons, recovery slows, and the monitor struggles.

Fix: Use the pg_autoscaler module (enabled by default in Reef+). For manual sizing: (OSDs * 100) / replica_count, rounded to a power of 2. Merge pools you don't need.

ceph osd pool set mypool pg_autoscale_mode on
ceph osd pool autoscale-status

4. Using a single CRUSH failure domain with 3-way replication

Your cluster has 3 hosts with 2 OSDs each. CRUSH rule says step chooseleaf firstn 0 type host. You add a 4th host and remove one old host. During the removal, 2 of your 3 hosts have OSD failures at the same time (one planned, one coincidental). PGs go inactive.

Fix: Ensure you always have at least replica_count CRUSH buckets at the failure domain level before removing any. ceph osd tree shows your bucket count at each level. With 3-way replication you need at least 3 live hosts in the CRUSH tree at all times.


5. Rook: using useAllDevices: true without understanding what "all" means

You set useAllDevices: true in your CephCluster CR. Rook adds OSDs on every block device it finds — including your OS disk (/dev/sda) on nodes that happen to have SSDs there. Your nodes become unbootable when Ceph overwrites the boot partition.

Fix: Always use deviceFilter or an explicit device list:

storage:
  useAllDevices: false
  deviceFilter: "^sd[b-z]"   # skip sda
Or enumerate devices explicitly per node using the nodes key.


6. Ignoring nearfull warnings

You see HEALTH_WARN: nearfull osd(s) and dismiss it as cosmetic. Two weeks later, one OSD crosses the full_ratio and the cluster halts all writes. Your application starts erroring.

Fix: Treat nearfull as a P2 incident. Add capacity, reduce data, or clean up stale snapshots before hitting the full threshold. Set up alerts on ceph health detail output or the Prometheus metrics: ceph_osd_stat_bytes_used / ceph_osd_stat_bytes > 0.80.

Under the hood: Ceph's default thresholds are nearfull_ratio: 0.85, backfillfull_ratio: 0.90, full_ratio: 0.95. At backfillfull, Ceph stops backfilling data to that OSD (recovery stalls). At full, all writes are blocked cluster-wide — not just to the full OSD. A single full OSD can freeze the entire cluster because PGs that include that OSD cannot accept writes.


7. Running ceph osd pause and forgetting

You pause OSD I/O for maintenance with ceph osd pause. You forget to un-pause. The cluster stops serving all data. All clients block.

Fix: Always use ceph osd unpause immediately after maintenance. Use a checklist. Better: use noout/norecover flags which are safer and more targeted:

ceph osd set noout       # prevent OSDs from being marked out during maintenance
ceph osd unset noout     # re-enable after done


8. Changing the CRUSH map on a loaded cluster without incremental rebalancing

You restructure the CRUSH hierarchy (e.g., rename buckets or change failure domain type). Ceph immediately remaps every PG to new OSD sets and begins moving enormous amounts of data. Network and OSD I/O saturate. Client latency spikes to seconds.

Fix: Use reweight-by-utilization for gradual rebalancing. For CRUSH changes, apply them in small increments and monitor recovery between each step. Enable the balancer module instead of manual CRUSH edits where possible:

ceph mgr module enable balancer
ceph balancer on
ceph balancer mode upmap


9. Not separating OSD journals/WAL/DB from spinning disks

You deploy BlueStore OSDs on spinning HDDs but put the RocksDB WAL and DB on the same HDD. Write amplification from RocksDB compactions on the same spindle creates severe latency spikes.

Fix: Use an NVMe SSD for the BlueStore DB and WAL (bluestore_block_db_path, bluestore_block_wal_path). Rule of thumb: 1 NVMe can back the DB/WAL for 4-6 HDDs. In cephadm:

spec:
  data_devices:
    paths:
      - /dev/sdb
  db_devices:
    paths:
      - /dev/nvme0n1


10. Trusting ceph -s HEALTH_OK while in noout mode

You set ceph osd set noout before a maintenance window. An OSD goes down while noout is active. Ceph shows HEALTH_WARN: noout flag(s) set but does NOT show degraded PGs because OSDs aren't being marked out. You assume the cluster is healthy and forget to unset noout. The down OSD stays down for days. When noout is finally cleared, Ceph marks it out and starts a huge rebalance.

Fix: When noout is set, check ceph osd tree | grep down explicitly. Add monitoring for ceph_osd_up == 0 separate from cluster health checks. Unset noout immediately after maintenance.


11. Scrub/deep-scrub running during peak hours and killing latency

Deep scrubs are I/O intensive. By default Ceph can run them any time. On a busy cluster, an unexpected deep scrub causes P99 latency to spike 10x.

Fix: Set scrub hours to your maintenance window:

ceph config set osd osd_scrub_begin_hour 22
ceph config set osd osd_scrub_end_hour 6
ceph config set osd osd_deep_scrub_interval 604800   # 7 days in seconds
And monitor scrub activity: ceph pg dump | grep scrubbing.


12. Deleting a pool without mon_allow_pool_delete

You try ceph osd pool delete mypool mypool --yes-i-really-really-mean-it and get an error. You Google it and find you need to set mon_allow_pool_delete = true in ceph.conf. You set it permanently. Now any admin can delete any pool at any time with no additional guard.

Fix: Set it temporarily for the delete operation, then unset:

ceph config set mon mon_allow_pool_delete true
ceph osd pool delete mypool mypool --yes-i-really-really-mean-it
ceph config set mon mon_allow_pool_delete false


13. Placing the Rook operator and OSDs in the same namespace without RBAC review

You deploy Rook in the default namespace for simplicity. The operator's service account has cluster-admin. Any pod that can exec into the operator can take over the cluster.

Fix: Always deploy Rook in its own namespace (rook-ceph). Review the RBAC: the operator needs broad permissions, but limit which namespaces can create CephCluster CRs. Use PodSecurityAdmission to restrict the operator pod itself.