Redis Footguns¶

Mistakes that cause production outages, data loss, or silent cache poisoning.

1. Using `KEYS *` in Production¶

An engineer or application calls KEYS * (or KEYS session:*) to list keys. Redis is single-threaded for command execution. On a Redis instance with 10 million keys, KEYS blocks for several seconds. Every other client times out. Your application reports a complete outage for the duration.

Fix: Replace every KEYS usage with SCAN. SCAN is cursor-based, returns a small batch per call, and is non-blocking. Use redis-cli --scan --pattern "prefix:*" on the CLI. Disable KEYS entirely in non-development environments: rename-command KEYS "" in redis.conf.

Under the hood: Redis processes commands on a single thread (the event loop). KEYS must iterate every key in the keyspace before returning — O(n) where n is total keys, not matching keys. On a 10M-key instance, this takes 1-5 seconds. During that time, every other client's command (including health checks) is queued. Load balancers see the health check timeout and mark the node as dead.

2. `maxmemory-policy noeviction` on a Cache¶

You deploy Redis as an application cache with maxmemory-policy noeviction (the default). When Redis fills up, write operations return OOM command not allowed when used memory > 'maxmemory'. Applications that expect a cache to silently drop stale data instead crash with unexpected errors.

Fix: For caches, use allkeys-lru or allkeys-lfu. For session stores or queues where data loss is unacceptable, use noeviction but ensure maxmemory is large enough and you have monitoring alerts at 80% usage. Never put a session store and a cache in the same Redis instance with incompatible eviction policies.

Default trap: Redis ships with maxmemory 0 (no limit) and maxmemory-policy noeviction. This means Redis will grow until the OS OOM-kills it. For production, you must set both: maxmemory to a specific byte value and maxmemory-policy to match your use case. The AWS ElastiCache default is volatile-lru, which only evicts keys with a TTL set — keys without TTL are never evicted.

3. No Persistence + Container Restart = Total Data Loss¶

You run Redis in a container with no persistent volume and no RDB/AOF configuration. A container restart (deployment, OOM kill, node drain) wipes all data. This is fine for a pure cache, but teams often start using Redis for rate limiting, session state, distributed locks, and queues — all of which need durability.

Fix: Decide at deployment time whether Redis data is ephemeral or durable. For ephemeral: document this clearly and ensure the application handles cold-cache startup correctly. For durable: mount a persistent volume, enable AOF (appendonly yes), and test restore from a container restart. Never let the default (no persistence) be an accidental choice.

4. Uncontrolled Replication Slot / Replication Buffer Growth¶

A replica falls behind (network issue, slow disk, high write load on master). The replication backlog buffer on the master grows unboundedly. When it overflows, the replica must do a full resync — the master forks and serializes the entire dataset. During full resync, the master's memory doubles due to copy-on-write. If the server has <50% free memory, the fork fails with OOM and Redis crashes.

Fix: Set repl-backlog-size to a reasonable value (e.g., 256mb) to give replicas a wider resync window. Monitor master_repl_offset - slave_repl_offset and alert on growing gaps. Ensure each Redis server host has at least 50% free memory to accommodate fork overhead. On Linux, set vm.overcommit_memory = 1.

5. FLUSHDB / FLUSHALL in the Wrong Environment¶

You're debugging a staging issue. You connect to Redis CLI and run FLUSHDB to clear the cache. You are connected to production. All keys in the database are gone instantly. There is no undo, no confirmation prompt, no undo log.

Fix: Use different authentication passwords per environment. Rename dangerous commands in production: rename-command FLUSHDB "" and rename-command FLUSHALL "" in redis.conf. Add the environment name to the CLI prompt by setting a server name: redis-cli config set cluster-announce-hostname prod-redis-01. Always double-check your connection string before running destructive commands.

War story: A widely shared 2017 incident: an engineer ran FLUSHALL against production Redis, wiping session data for millions of users during peak traffic. The root cause was identical connection strings for staging and production, differing only in the hostname — which was copy-pasted wrong. Separate passwords per environment would have caught the mistake with an auth error instead of data loss.

6. Storing Large Values Without TTL Causes Gradual Memory Leak¶

You cache rendered HTML pages or large JSON blobs in Redis without TTLs. New keys are written as content changes, but old keys are never expired. Over weeks, Redis fills with thousands of stale large keys. Memory hits the limit, eviction kicks in, and you start evicting frequently-accessed hot keys to make room for stale garbage.

Fix: Always set a TTL on every key you write to Redis, even if it's long (7 days, 30 days). Use redis-cli --bigkeys periodically to find large keys. Run redis-cli --scan --pattern "*" | xargs -L 100 redis-cli object encoding to find keys without TTLs. Design cache key schemes that include version or content hash so stale keys naturally stop being requested even if they linger.

7. Using Redis Cluster with Multi-Key Commands¶

You migrate from a standalone Redis to Redis Cluster. Commands that touch multiple keys (MGET, MSET, SUNION, ZUNIONSTORE, Lua scripts with multiple keys) start failing with CROSSSLOT Keys in request don't hash to the same slot. Redis Cluster requires all keys in a multi-key command to be in the same slot.

Fix: Use hash tags to force related keys onto the same slot: {user:123}:profile and {user:123}:sessions both hash on user:123. Plan your key naming scheme before migrating to Cluster. For commands you can't easily refactor, execute them in application code across multiple single-key commands, or use a Redis module like RedisJSON that handles this internally.

8. Sentinel Not Actually Achieving Quorum¶

You deploy Redis Sentinel with 2 sentinel processes for high availability. The master fails. No failover happens — the 2 sentinels can't reach quorum because quorum requires strictly more than half the sentinels to agree. You think you have HA but you have a single point of failure.

Fix: Always deploy an odd number of sentinels: 3 is the minimum for a working quorum. The quorum value should be (n/2) + 1 where n is the number of sentinels. A 2-sentinel deployment achieves nothing — if the master fails and one sentinel is unreachable, you still can't failover. Test failover regularly: redis-cli -p 26379 sentinel failover mymaster.

Debug clue: redis-cli -p 26379 sentinel master mymaster shows the current master and its status. sentinel slaves mymaster lists replicas. If num-other-sentinels is less than expected, sentinels can't see each other — check network connectivity between sentinel instances on port 26379. sentinel ckquorum mymaster explicitly tests whether quorum is achievable.

9. Transparent Huge Pages Causing Latency Spikes¶

Redis is running fine but you see periodic latency spikes every few minutes. redis-cli --latency-history shows the spikes clearly. The root cause is Linux's Transparent Huge Pages (THP): when Redis forks for RDB/AOF rewrite, the kernel must split huge pages on copy-on-write, causing stalls.

Fix: Disable THP on all Redis hosts:

echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Make it persistent in /etc/rc.local or a systemd unit. Redis itself will warn about this in its logs: WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This warning is real — fix it.

10. Treating `INCR` as a Reliable Distributed Counter Without Understanding TTL Interaction¶

You use INCR mykey to count events and EXPIRE mykey 3600 to reset it hourly. You set these as separate commands. A race condition exists: if the key expires between the INCR and the EXPIRE, the next INCR creates a new key with no expiry set, and the counter accumulates forever. You discover this months later when one key has 50 million counts.

Fix: Use SET mykey 0 EX 3600 NX to initialize atomically (set only if not exists). Or use Lua scripts or MULTI/EXEC transactions to ensure INCR + EXPIRE are atomic. Better: use INCR and then EXPIREAT to set an absolute expiry at the hour boundary, not a relative expiry that resets the window on each call.