Interview Gauntlet: Customer Reports Data Inconsistency¶
Category: Incident Response Difficulty: L2-L3 Duration: 15-20 minutes Domains: Caching, Consistency
Round 1: The Opening¶
Interviewer: "A customer opens a support ticket: 'I updated my profile 2 hours ago but the old information is still showing.' Support confirms the update exists in the database. What do you investigate?"
Strong Answer:¶
"Classic read-after-write inconsistency. The customer wrote data to one place and is reading it from another. The usual architecture for a profile page is: write goes to the primary database, read might come from a cache (Redis, Memcached), a read replica, or a CDN if the profile is rendered server-side and cached. I'd trace the read path: when the profile page loads, where does the data come from? Check the API response headers for cache indicators — X-Cache: HIT, Age: 7200, or custom headers like X-Data-Source: cache. If the API is reading from Redis, I'd compare the cached value with the database value: redis-cli GET user:12345:profile vs SELECT * FROM users WHERE id = 12345. If the cache has stale data, the cache invalidation on write didn't fire or didn't propagate. If the data matches in both the database and cache, the issue might be a CDN caching the HTML page, or a browser cache if the response has aggressive cache-control headers."
Common Weak Answers:¶
- "It's probably a browser cache — tell them to clear it." — While possible, dismissing a customer-reported data integrity issue without investigation is a support anti-pattern and misses the systemic issue.
- "Check if the database write was committed." — The premise states the update exists in the database. Reading the problem carefully before jumping to solutions matters.
- "It'll fix itself eventually." — Even if true (eventual consistency), this doesn't address why 2 hours is too long, or whether the consistency model matches the product's promises.
Round 2: The Probe¶
Interviewer: "You confirm it's the caching layer — Redis has the old profile data. The application is supposed to invalidate the cache on write. Why didn't the cache invalidation work?"
What the interviewer is testing: Understanding of cache invalidation patterns and the common failure modes of each.
Strong Answer:¶
"Cache invalidation can fail in several ways depending on the pattern used. If we're using cache-aside (application deletes the cache key after writing to the database), the invalidation might have failed silently — the DEL command to Redis could have failed due to a network blip, and if the application doesn't check the return value or retry, the stale key persists. I'd check the application logs around the time of the customer's write for Redis connection errors or timeouts. If we're using write-through (application writes to both database and cache), the cache write might have been lost if it's done asynchronously — the database write succeeded but the cache write was queued and dropped. If we're using pub/sub invalidation (a database trigger or Change Data Capture event triggers cache invalidation), the message might have been lost or the consumer might have been down. I'd check: first, is the invalidation code even executing on this code path? Profile updates might go through a different API route than the one instrumented with cache invalidation — especially if the update was done through an admin panel or a bulk import. Second, is there a race condition? If another process reads the profile and re-populates the cache between the database write and the cache invalidation, the stale data is back."
Trap Alert:¶
If the candidate bluffs here: The interviewer will ask "Describe the difference between cache-aside, write-through, and write-behind patterns." These are fundamental caching patterns. Cache-aside: application manages cache directly (read: check cache, miss -> read DB, populate cache; write: write DB, invalidate cache). Write-through: application writes to cache and cache writes to DB synchronously. Write-behind: application writes to cache, cache asynchronously writes to DB. Mixing up write-through and write-behind is a common mistake.
Round 3: The Constraint¶
Interviewer: "You discover the cache invalidation is working for normal profile updates, but this customer used a new bulk import feature that bypasses the normal API. The bulk import writes directly to the database using a batch SQL statement. The cache layer has no idea the data changed. How do you fix this without rewriting the bulk import?"
Strong Answer:¶
"If the bulk import writes directly to the database and bypasses the application layer where cache invalidation lives, I need a mechanism that watches for database changes and invalidates the cache independently. Several options. First, Change Data Capture (CDC): tools like Debezium for PostgreSQL or MySQL can capture row-level changes from the WAL or binlog and emit events to Kafka. A small consumer service reads those events and issues Redis DEL commands for the affected cache keys. This works for any code path that modifies the database, not just the application API. Second, database triggers: a PostgreSQL NOTIFY on the users table that a lightweight listener process picks up and translates to cache invalidations. This is simpler than CDC but tighter coupled to the database. Third, a more pragmatic approach: add a cache invalidation step at the end of the bulk import job. After the batch insert commits, the job iterates over the affected user IDs and deletes their cache keys. This is the simplest fix — no new infrastructure — but it only works for the bulk import, not for future bypasses. I'd implement option three immediately to fix the bug, and option one (CDC) as a longer-term solution that catches all database writes regardless of origin."
The Senior Signal:¶
What separates a senior answer: Recognizing that the root cause isn't a bug in cache invalidation — it's an architectural gap where writes can enter the database through a path that the cache doesn't know about. The CDC approach solves the category of problem, not just this instance. Also: being pragmatic about the fix timeline — ship the quick fix now, build the robust solution later.
Round 4: The Curveball¶
Interviewer: "A developer suggests: 'Just set a low TTL on the cache — 5 minutes. Then even if invalidation misses, the data is at most 5 minutes stale.' Is this a good solution?"
Strong Answer:¶
"It's a reasonable trade-off for some use cases but not a complete solution, and there are hidden costs. The benefit: a 5-minute TTL provides a bounded staleness window, so even if all invalidation fails, the worst case is 5 minutes of stale data. For a user profile page, that might be acceptable. The costs: first, cache hit rate drops significantly. With a 24-hour TTL, a popular profile is cached all day and served from Redis. With a 5-minute TTL, it's evicted and re-fetched from the database 288 times per day. For a high-traffic service, this dramatically increases database read load. Second, it creates 'thundering herd' problems: when a popular cache key expires, many concurrent requests simultaneously miss the cache and hit the database. Third, it masks the invalidation bug rather than fixing it — the 5-minute staleness is still there for the bulk import path. For this specific case, I'd argue: fix the invalidation bug properly (it's a correctness issue), and use TTL as a safety net. Set the TTL to something reasonable like 1 hour — short enough to bound the worst case, long enough to maintain a good cache hit rate. The TTL should be the last line of defense, not the primary consistency mechanism."
Trap Question Variant:¶
The right answer requires nuance. Candidates who say "yes, just lower the TTL" are missing the performance implications. Candidates who say "no, TTLs are a crutch" are being purist. The senior answer is: "TTL is a valid safety net but not a substitute for correct invalidation. The TTL value should be chosen based on the acceptable staleness window for the use case AND the database's ability to handle the increased read load when cache entries expire."
Round 5: The Synthesis¶
Interviewer: "Caching is supposed to improve performance, but in this case it caused a data integrity issue that became a customer support ticket. How do you think about the decision of what to cache and what not to cache?"
Strong Answer:¶
"Not everything should be cached, and the decision framework should consider three dimensions: the cost of staleness, the frequency of change, and the read-to-write ratio. For data where staleness causes customer-visible confusion — like user profiles, account balances, or order status — the cache strategy must guarantee consistency or have extremely short staleness bounds. For data where staleness is tolerable — like product catalog pages, recommendation results, or aggregate dashboards — longer TTLs and looser invalidation are fine. The read-to-write ratio matters too: a product page that's read 10,000 times per write is a perfect caching candidate. A real-time chat message that's read once per write is a terrible one. In practice, I'd establish a caching policy: all new caches must specify their consistency guarantee (strong, bounded staleness, eventual), the TTL rationale, and the invalidation mechanism. The policy should explicitly call out 'user-mutable data' as requiring tested invalidation — not just 'it should work' but an integration test that writes data and verifies the cache is updated. The customer support ticket isn't just a bug — it's a signal that the caching design didn't account for all write paths, and the testing didn't verify the user experience end-to-end."
What This Sequence Tested:¶
| Round | Skill Tested |
|---|---|
| 1 | Systematic approach to read-after-write inconsistency |
| 2 | Cache invalidation pattern knowledge and failure mode analysis |
| 3 | Architectural thinking about database bypass paths (CDC vs tactical fix) |
| 4 | Critical evaluation of proposed solutions with trade-off awareness |
| 5 | Caching strategy and policy design at organizational level |