Redis Operations - Street-Level Ops¶

What experienced Redis operators know that tutorials don't teach.

Quick Diagnosis Commands¶

# Connect to Redis CLI
redis-cli                           # local, default port 6379
redis-cli -h redis.internal -p 6379
redis-cli -h redis.internal -p 6379 -a <password>
redis-cli -h redis.internal -p 6379 --no-auth-warning -a <password>

# Basic health check
redis-cli ping                      # → PONG
redis-cli -h redis.internal ping

# Server info (everything)
redis-cli info
redis-cli info server               # version, uptime, OS
redis-cli info clients              # connected clients, blocked, tracking
redis-cli info memory               # memory usage, fragmentation, eviction stats
redis-cli info stats                # commands processed, keyspace hits/misses
redis-cli info replication          # role, master/replica info, replication lag
redis-cli info keyspace             # keys per database, expires
redis-cli info persistence          # RDB/AOF status, last save time

# Memory analysis
redis-cli memory doctor             # automated memory health check
redis-cli memory usage mykey        # size in bytes of a specific key
redis-cli memory usage mykey SAMPLES 0  # exact size (slower)
redis-cli memory stats              # detailed memory breakdown

# Slow log — queries that exceeded slowlog-log-slower-than
redis-cli slowlog get 25            # last 25 slow commands
redis-cli slowlog len               # total slow log entries
redis-cli slowlog reset

# Monitor — live command stream (WARNING: CPU intensive, dev only)
redis-cli monitor

# Latency diagnosis
redis-cli --latency                 # measure latency in ms (Ctrl-C to stop)
redis-cli --latency-history         # latency samples over time
redis-cli --latency-dist            # latency distribution histogram
redis-cli debug sleep 0             # test latency baseline

# Key scanning (NEVER USE KEYS IN PRODUCTION)
redis-cli scan 0 COUNT 100          # returns (cursor, [keys]) — iterate with cursor
redis-cli scan 0 MATCH "session:*" COUNT 100
redis-cli scan 0 TYPE string COUNT 100

# Count keys matching pattern safely
redis-cli --scan --pattern "session:*" | wc -l

# Database info
redis-cli dbsize                    # total key count (current db)
redis-cli info keyspace             # key count per database

# Client list
redis-cli client list               # all connected clients
redis-cli client list TYPE normal   # filter by type
redis-cli client getname            # name of current connection
redis-cli client kill ID <id>       # kill specific client

Common Scenarios¶

Scenario 1: Redis Memory Approaching Limit¶

Redis memory usage is at 90% of maxmemory. Application is seeing cache misses or write errors.

Diagnosis:

redis-cli info memory
# Check:
# used_memory_human — actual data
# used_memory_rss_human — OS-level memory (includes fragmentation)
# mem_fragmentation_ratio — > 1.5 means high fragmentation; < 1 means using swap
# maxmemory_policy — how Redis evicts (or errors) when full
# evicted_keys — from info stats; nonzero means eviction is happening
# maxmemory — current limit (0 = no limit)

redis-cli memory doctor
# Reports: fragmentation, big key warnings, AOF buffer, etc.

# Find your biggest keys
redis-cli --bigkeys              # scans entire keyspace, finds largest per type
# Outputs: top 5 largest keys per data type, summary stats

# Find keys that should have expired
redis-cli info stats | grep expired_keys   # total expired

Fix:

# Check current eviction policy
redis-cli config get maxmemory-policy

# Set eviction policy (choose based on workload)
redis-cli config set maxmemory-policy allkeys-lru    # evict LRU regardless of TTL
redis-cli config set maxmemory-policy volatile-lru   # evict only keys with TTL
redis-cli config set maxmemory-policy volatile-ttl   # evict soonest-to-expire first
redis-cli config set maxmemory-policy noeviction     # return errors when full (default for RDB)

# Increase memory limit (if headroom exists)
redis-cli config set maxmemory 4gb

# Defragment memory (Redis 4.0+, use during low traffic)
redis-cli config set activedefrag yes
redis-cli memory purge              # force immediate defrag attempt

# Delete large/stale keys
redis-cli del bigkey                # immediate delete
redis-cli unlink bigkey             # async delete (better for large keys)

# Scan and delete keys by pattern (do NOT use KEYS + DEL in production)
redis-cli --scan --pattern "tmp:*" | xargs redis-cli del

Scenario 2: Replication Lag or Replica Not Syncing¶

A Redis replica is falling behind or shows errors in replication.

Diagnosis:

# On replica
redis-cli info replication
# Check:
# role: slave
# master_host, master_port — correct?
# master_link_status: up/down
# master_last_io_seconds_ago — time since last data from master
# master_sync_in_progress: 1 if full sync in progress
# slave_repl_offset vs master_repl_offset — gap is lag in bytes

# On primary
redis-cli info replication
# connected_slaves — should show replicas
# slave0: ip=...,port=...,state=online,offset=...,lag=0

# Check for replication errors
redis-cli -p 6379 DEBUG JMAP    # not for production; check logs instead
tail -f /var/log/redis/redis-server.log

Fix:

# If replica is permanently disconnected, force a full resync
redis-cli replicaof NO ONE           # detach from master
redis-cli replicaof <master-ip> 6379  # reattach (triggers PSYNC or full SYNC)

# If partial resync fails (replica offset too far behind):
# Increase repl-backlog-size on primary (allows partial resync window)
redis-cli config set repl-backlog-size 256mb

# On the replica, if you want to restart fresh
redis-cli debug flushall             # WARNING: clears all data on replica
# Then reconfigure replicaof

# Check replication buffer
redis-cli info clients | grep slave  # output_buffer for slaves

Scenario 3: KEYS Command Blocking Production¶

An engineer ran redis-cli keys "*" or an application called KEYS in production. Redis is single-threaded for command execution. KEYS on a 10M key database takes seconds, blocking all other commands. Applications see timeouts across the board.

Diagnosis:

# Check if KEYS is in the slow log
redis-cli slowlog get 10 | grep -A5 KEYS

# See current blocked clients
redis-cli info clients | grep blocked_clients

# Monitor for KEYS usage
redis-cli monitor 2>/dev/null | grep " KEYS "  # dangerous in production — sample briefly

Fix:

# Short-term: if a long KEYS is running, identify and kill the client
redis-cli client list | grep -i "cmd=keys"   # find the culprit client
redis-cli client kill ID <id>

# Replace all KEYS usage with SCAN:
# SCAN is O(1) per call, paginated, safe in production
# Application pattern:
cursor=0
while true:
  cursor, keys = redis.scan(cursor, match="session:*", count=100)
  process(keys)
  if cursor == 0: break

# On the CLI:
redis-cli --scan --pattern "session:*" COUNT 100

# Rename or disable dangerous commands via redis.conf
rename-command KEYS ""           # disable entirely
rename-command FLUSHDB "FLUSHDB_RENAMED_abc123"   # rename to obscure name

Scenario 4: Persistence Failure (RDB Save Failing)¶

Redis logs show "BGSAVE failed: No space left on device" or RDB save keeps failing, leaving persistence disabled.

Diagnosis:

redis-cli info persistence
# Check:
# rdb_bgsave_in_progress: 1 if saving now
# rdb_last_bgsave_status: ok or err
# rdb_last_bgsave_time_sec: how long the last save took
# rdb_last_cow_size: copy-on-write memory used during last save
# aof_enabled: 0 or 1
# aof_last_bgrewrite_status: ok or err

# Check disk space
df -h /var/lib/redis

Fix:

# Check if save is configured
redis-cli config get save     # e.g., "3600 1 300 100 60 10000"

# Disable automatic saves if disk is full (emergency)
redis-cli config set save ""

# Trigger a manual save (blocks until complete — careful with large datasets)
redis-cli bgsave               # background save (returns immediately)
redis-cli lastsave             # timestamp of last successful RDB save

# For AOF issues
redis-cli bgrewriteaof        # compact the AOF file
redis-cli config set appendonly no   # disable AOF temporarily if rewrite is failing

# If RDB is consistently failing due to low memory (fork OOM):
# Reduce vm.overcommit_memory setting on OS:
echo 1 | sudo tee /proc/sys/vm/overcommit_memory
# Make permanent: echo 'vm.overcommit_memory = 1' >> /etc/sysctl.conf

Key Patterns¶

RDB vs AOF Persistence¶

# RDB (point-in-time snapshots)
# redis.conf settings:
save 3600 1     # save if 1+ keys changed in the last 3600s
save 300 100    # save if 100+ keys changed in the last 300s
save 60 10000   # save if 10000+ keys changed in the last 60s
dbfilename dump.rdb
dir /var/lib/redis

# AOF (append-only log — better durability)
appendonly yes
appendfsync everysec   # sync to disk every second (balance of safety/perf)
# appendfsync always   # sync every write (safest, slowest)
# appendfsync no       # OS decides (fastest, can lose data)
auto-aof-rewrite-percentage 100   # rewrite when AOF doubles in size
auto-aof-rewrite-min-size 64mb

# Both (RDB + AOF) — recommended for production
# RDB for fast restarts, AOF for durability
appendonly yes
save 3600 1

Eviction Policies Reference¶

Remember: allkeys-lru is the safe default for caches. noeviction is the safe default for data you cannot afford to lose (sessions, queues). If you are unsure, start with allkeys-lru -- the worst case is a cache miss, not an application error.

noeviction      — return error when maxmemory reached (use for sessions, not caches)
allkeys-lru     — evict LRU key from all keys (recommended for pure caches)
volatile-lru    — evict LRU key only from keys with TTL set
allkeys-lfu     — evict least frequently used from all keys (Redis 4.0+)
volatile-lfu    — evict LFU from keys with TTL
allkeys-random  — evict random key from all keys
volatile-random — evict random key from keys with TTL
volatile-ttl    — evict the key with nearest expiry time

Cluster Diagnosis¶

# Cluster health
redis-cli cluster info
# cluster_enabled: 1
# cluster_state: ok (or fail)
# cluster_slots_assigned: should be 16384
# cluster_known_nodes: total nodes
# cluster_size: number of master nodes

# Node topology
redis-cli cluster nodes           # all nodes, their roles, slots
redis-cli cluster myid            # this node's ID

# Check slot assignment
redis-cli cluster info | grep slots
redis-cli cluster check <any-node-host> <port>

# Find which node owns a key
redis-cli cluster keyslot mykey   # returns slot number (0-16383)
redis-cli -c -h <any-node> get mykey   # -c enables cluster redirect

# Reshard slots (manual rebalancing)
redis-cli --cluster rebalance <any-node>:<port> --cluster-use-empty-masters

# Add a new node
redis-cli --cluster add-node <new-node-ip>:<port> <existing-node-ip>:<port>

# Move a node to replica
redis-cli cluster replicate <master-node-id>

Keyspace Notifications¶

# Enable notifications (redis.conf or runtime)
redis-cli config set notify-keyspace-events "KEA"
# K = keyspace events, E = keyevent events
# A = all events (g$ lz xed)
# x = expired events only: "Kx"

# Subscribe to expiry events (in a separate client)
redis-cli psubscribe "__keyevent@0__:expired"

# Subscribe to all keyspace events on DB 0
redis-cli psubscribe "__keyspace@0__:*"

# Useful combinations:
notify-keyspace-events "Kx"   # only expired key events (low overhead)
notify-keyspace-events "Kg"   # generic commands (del, expire, rename)

Sentinel (High Availability without Cluster)¶

# Check sentinel status
redis-cli -p 26379 sentinel masters
redis-cli -p 26379 sentinel slaves mymaster
redis-cli -p 26379 sentinel sentinels mymaster

# Failover state
redis-cli -p 26379 sentinel ckquorum mymaster
# → OK 2 usable Sentinels. Quorum and failover authorization can be reached

# Force a manual failover (for maintenance)
redis-cli -p 26379 sentinel failover mymaster

# Sentinel conf
sentinel monitor mymaster 10.0.0.1 6379 2   # 2 = quorum
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
sentinel auth-pass mymaster <password>

Connection and Performance Tuning¶

Default trap: tcp-backlog defaults to 511, but the OS enforces net.core.somaxconn as the ceiling. If somaxconn is 128 (common default), Redis silently clamps the backlog to 128 and you get connection drops under load. Always tune both: sysctl -w net.core.somaxconn=65535 on the host, then set tcp-backlog 65535 in redis.conf.

# redis.conf tuning for production
tcp-keepalive 300         # keepalive interval in seconds
timeout 0                 # don't close idle connections (let app manage)
tcp-backlog 511           # connection backlog (match net.core.somaxconn)
hz 10                     # background task frequency (increase for tighter expiry)

# Disable THP (Transparent Huge Pages) — Redis docs recommend this
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

# Network tweaks for Redis
sudo sysctl -w net.core.somaxconn=65535
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=65535

Redis Operations - Street-Level Ops¶

Quick Diagnosis Commands¶

Common Scenarios¶

Scenario 1: Redis Memory Approaching Limit¶

Scenario 2: Replication Lag or Replica Not Syncing¶

Scenario 3: KEYS Command Blocking Production¶

Scenario 4: Persistence Failure (RDB Save Failing)¶

Key Patterns¶

RDB vs AOF Persistence¶

Eviction Policies Reference¶

Cluster Diagnosis¶

Keyspace Notifications¶

Sentinel (High Availability without Cluster)¶

Connection and Performance Tuning¶

Pages that link here¶