Skip to content

Interview Gauntlet: Secrets Management System

Category: System Design Difficulty: L2-L3 Duration: 15-20 minutes Domains: Security, Compliance


Round 1: The Opening

Interviewer: "Design a secrets management system for an organization running 30 microservices across Kubernetes and some legacy VMs. How do you manage database passwords, API keys, and TLS certificates?"

Strong Answer:

"I'd centralize on HashiCorp Vault as the secrets backend. It gives us a single control plane for three distinct secret types: dynamic database credentials via the database secrets engine, static API keys via the KV v2 secrets engine, and TLS certificates via the PKI secrets engine. For Kubernetes services, I'd use the Vault Agent Injector — a mutating webhook that injects a sidecar into pods. The sidecar authenticates to Vault using the pod's Kubernetes ServiceAccount token, fetches secrets, and writes them to a shared volume that the app reads from. For legacy VMs, I'd use Vault Agent running as a systemd service that authenticates via AppRole or cloud provider IAM (like AWS IAM auth), fetches secrets, and writes them to a file on disk with proper permissions. Auth is the critical piece: every workload authenticates to Vault using an identity it already has — K8s SA tokens, AWS IAM roles, Azure MSI — so there are no bootstrap secrets to manage."

Common Weak Answers:

  • "We'd use Kubernetes Secrets." — K8s Secrets are base64-encoded, not encrypted at rest by default, and don't solve the VM use case or rotation.
  • "Store them in environment variables." — Env vars leak into process listings, crash dumps, and child processes. This is a delivery mechanism, not a management system.
  • "Use AWS Secrets Manager." — Fine for AWS-only, but the question specified mixed infrastructure. Jumping to a single cloud provider's solution without acknowledging the constraint is a miss.

Round 2: The Probe

Interviewer: "You said dynamic database credentials. Walk me through exactly what happens when Service A needs to connect to PostgreSQL. What's the lifecycle of that credential, from creation to revocation?"

What the interviewer is testing: Whether the candidate understands Vault's database secrets engine beyond the marketing page — specifically TTLs, leases, and what happens when things go wrong.

Strong Answer:

"Service A's pod starts up. The Vault Agent sidecar authenticates to Vault using the pod's ServiceAccount JWT, which Vault validates against the Kubernetes API. Vault checks the policy attached to Service A's role — let's call it svc-a-db-readonly. That policy allows reading from database/creds/svc-a-readonly. When the sidecar requests that path, Vault's database secrets engine connects to PostgreSQL using its own privileged credentials and executes CREATE ROLE 'v-svc-a-xxxx' WITH LOGIN PASSWORD 'random' VALID UNTIL '...'. The credential has a TTL — say 1 hour. Vault returns the username and password to the sidecar, which writes them to /vault/secrets/db-creds. The app reads that file. Before the TTL expires, the Vault Agent automatically renews the lease. If the pod dies or the lease isn't renewed, Vault revokes the credential after the max TTL — deleting the PostgreSQL role. So at any given time, each pod has its own unique credential that's automatically rotated and automatically cleaned up. If we need to revoke all credentials immediately — say, during a security incident — we revoke the lease prefix for that path, and Vault drops every dynamic credential it issued."

Trap Alert:

If the candidate bluffs here: The interviewer will ask "What happens if Vault can't reach PostgreSQL when it tries to revoke the credential?" This is a real operational edge case. The honest answer is: "Vault will retry revocation, but if PostgreSQL is unreachable, the credential persists until PostgreSQL's own VALID UNTIL expiry. This is why the TTL matters — it's the worst-case window of exposure. I'd set TTLs short enough that even failed revocations are bounded."


Round 3: The Constraint

Interviewer: "The company just acquired a business unit running in Azure. Another team runs on-prem. Now you need to manage secrets across AWS, Azure, and on-prem — and the compliance team needs a full audit trail of every secret access. How does your design change?"

Strong Answer:

"Vault is actually well-suited for this because it separates the auth backends from the secrets engines. For Azure workloads, I'd add the Azure auth method — VMs authenticate using their Managed Service Identity, AKS pods use workload identity federation. The on-prem systems use AppRole auth with wrapped tokens distributed by a configuration management tool like Ansible. Each environment gets its own Vault namespace or a separate Vault cluster peered together, depending on latency and compliance boundaries. For the audit trail, Vault has built-in audit devices — I'd enable the file audit device writing to a persistent volume, and a socket audit device shipping to our SIEM (Splunk, Elasticsearch, or whatever the compliance team uses). Every single Vault API call — every secret read, every authentication, every policy change — is logged with the accessor identity, timestamp, and the path accessed. The response body is HMAC'd by default so the actual secret values aren't in the audit log, but you can prove who accessed what when. One critical detail: Vault will not serve any requests if all audit devices are blocked. So if the audit log volume fills up, Vault stops working entirely — by design, to prevent unaudited access. That means the audit pipeline itself needs monitoring and capacity planning."

The Senior Signal:

What separates a senior answer: Knowing that Vault refuses to operate if audit logging fails. This is a design choice most people discover in production the hard way — the audit log disk fills up, and suddenly no service can fetch secrets. Mentioning this proactively shows real operational experience. Also: understanding that Vault namespaces provide multi-tenancy within a single cluster, which is cheaper and simpler than running separate Vault clusters per environment.


Round 4: The Curveball

Interviewer: "Your Vault cluster goes down. All of it — both nodes, the storage backend, everything. What happens to your 30 running microservices?"

Strong Answer:

"In the short term, nothing breaks immediately. Services that already have their secrets cached — either in the Vault Agent's in-memory cache or written to the tmpfs volume — keep running with their current credentials. The Vault Agent sidecar will retry connections to Vault and log errors, but it won't crash the application. The problem starts when credentials need to rotate. Dynamic database credentials with a 1-hour TTL will expire and not get renewed. The Vault Agent can be configured with exit_after_auth: false and will keep retrying, but the app will eventually fail when it tries to use an expired credential. For TLS certificates managed by cert-manager with Vault as the issuer, any cert renewals during the outage will fail. So the impact is time-bounded: services are fine for the duration of their current credential TTL, then start failing. For disaster recovery, Vault supports integrated Raft storage with snapshots — I'd have automated Raft snapshots to S3 every 15 minutes. Recovery means: spin up new Vault nodes, restore from snapshot, unseal with the recovery keys (or auto-unseal via AWS KMS). The recovery time target is under 30 minutes if we've practiced the runbook. To reduce the blast radius, I'd also ensure services have a fallback: the Vault Agent can cache the last-known-good secrets on a persistent volume, and the app should handle credential refresh failures gracefully — retry with the existing credential before crashing."

Trap Question Variant:

The right answer is nuanced. Candidates who say "everything keeps working because secrets are cached" are half right. Candidates who say "everything dies immediately" are wrong. The senior answer is: "It depends on the TTLs. We have a grace period equal to the shortest credential TTL, and after that, services start degrading. The actual blast radius depends on how we've configured our TTLs and caching."


Round 5: The Synthesis

Interviewer: "An engineering director pushes back: 'Vault is too complex. Why can't we just use AWS Secrets Manager and Azure Key Vault directly?' Make the case for or against your centralized approach."

Strong Answer:

"Honestly, that's a legitimate option and might even be the right call depending on the organization. Here's the trade-off: using native cloud secret stores (AWS Secrets Manager, Azure Key Vault) is simpler per-environment and has zero operational overhead — the cloud provider runs the infrastructure. The downsides are fragmented access control (IAM policies in AWS and Azure are different languages), no single audit trail (you'd need to aggregate CloudTrail and Azure Activity Log), and the on-prem gap (neither cloud service helps the on-prem workloads). Vault gives you one policy language, one audit log, and one API across all environments — but you own the infrastructure, the upgrades, the unsealing, the HA configuration. My recommendation depends on the team: if we have a platform team that can operate Vault reliably, the centralized model saves cross-cutting compliance effort. If we don't have that team, using native cloud stores per environment with a compliance aggregation layer on top is pragmatically better than running Vault poorly. The worst outcome is deploying Vault, not investing in operating it, and having it become the single point of failure for every service in the company."

What This Sequence Tested:

Round Skill Tested
1 Breadth of secrets management architecture
2 Deep understanding of dynamic credential lifecycle
3 Multi-cloud design and compliance audit requirements
4 Failure mode analysis and disaster recovery thinking
5 Pragmatic trade-off communication and organizational awareness

Prerequisite Topic Packs