Skip to content

On-Call Survival: Security

Print this. Pin it. Read it at 3 AM.

When in doubt: contain first, investigate second, explain third.


Alert: Compromised Credentials / Leaked Secret

Severity: P1

First command:

# Identify scope: which secret, where it was used
git log --all --oneline | head -20    # Was it committed to git?
# Check secret scanning alerts in GitHub Security tab
gh api repos/<org>/<repo>/secret-scanning/alerts --jq '.[].secret_type'
What you're looking for: What type of credential (API key, DB password, token), where it was exposed (git history, logs, public repo), and what it grants access to.

Decision tree:

Was the secret committed to git?
├── Yes  ROTATE IMMEDIATELY (even if repo is private  assume it was indexed).
         Revoke old credential in the issuing system (AWS console, GitHub settings, etc.)
         Issue new credential. Update secret in K8s/CI.
         Purge from git history: escalate to git admin (requires forced history rewrite).
         Log the incident: what was exposed, when, rotation timestamp.
└── No  Was it exposed in logs / error messages?
    ├── Yes  Rotate credential. Truncate/delete affected log files.
             Check who has log access. Escalate to security team.
    └── No  Phishing / social engineering?
              Escalate to security team immediately. Do not investigate alone.

Escalation trigger: Secret grants production DB/cloud access; secret has been active for > 1 hour post-exposure; cannot identify exposure scope; evidence of use by unauthorized party.

Safe actions: Identify scope, check secret scanning alerts — read-only before escalation.

Dangerous actions: Rotating credentials (brief service disruption), purging git history (destructive, requires coordination).


Alert: Unauthorized Access / Suspicious Activity

Severity: P1

First command:

# Kubernetes: who has been accessing the API server
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
# Cloud: recent API calls (AWS CloudTrail / GCP Audit Logs)
# Check: logins from unexpected IPs, unusual resource creation/deletion
What you're looking for: Actions from unexpected users, IPs, service accounts, or unusual times (off-hours spikes).

Decision tree:

Is there an active session / connection still open?
├── Yes  Contain immediately:
         Kubernetes: kubectl delete rolebinding/clusterrolebinding <suspicious-binding>
         Cloud: Revoke IAM user access key or assume-role session
         SSH: pkill -u <user> or block at firewall level
         THEN: collect evidence before more cleanup (screenshots, logs)
└── No (historical activity, no active session)?
    ├── Assess blast radius: what did the intruder access/create/delete?
    ├── Preserve logs: copy audit logs before they rotate
       kubectl get events -A > /tmp/k8s-events-$(date +%Y%m%d).txt
    └── Escalate to security team with: actor, actions, timeline, resources affected.

Escalation trigger: ANY unauthorized access to production systems. Do not try to resolve alone — escalate immediately.

Safe actions: Read audit logs, get events, identify suspicious actors — read-only.

Dangerous actions: Revoke access (may alert intruder to containment), delete evidence, delete resources (preserve for forensics).


Alert: Critical CVE in Running Container/Package

Severity: P1 (CVSS ≥ 9, exploitable in your context) / P2 (CVSS 7-9)

First command:

# Check which images are running
kubectl get pods -A -o jsonpath='{range .items[*]}{.spec.containers[*].image}{"\n"}{end}' | sort -u
# Check CVE scanner output (Trivy, Snyk, Grype)
trivy image <image>:<tag> --severity CRITICAL,HIGH
What you're looking for: Is the vulnerable package in a running image? Is the vulnerability exploitable (network-accessible, requires user input, etc.)?

Decision tree:

Is the vulnerability actively exploitable in your deployment?
├── No (not network-reachable, not in code path) → Log it. Schedule patch within SLA.
│   CVSS 9+: patch within 7 days. CVSS 7-9: patch within 30 days.
└── Yes → Is a patched base image available?
    ├── Yes → Rebuild image with patched base. Test. Deploy.
    │          Fast track through CI: treat as hotfix.
    └── No → Mitigate while waiting for patch:
             - Network policy: restrict access to affected service
             - WAF rule: block exploit patterns if known
             - Disable the vulnerable feature if possible
             Escalate to security: "CVE <id> in <image>, no patch available, mitigation applied: <describe>"

Escalation trigger: CVSS ≥ 9 with active exploit in the wild; vulnerability in auth/crypto code; evidence of exploitation attempt in logs.

Safe actions: Scan images, check CVE details — read-only.

Dangerous actions: Deploying unvetted patches to production, disabling security controls to test exploitability.


Alert: Certificate Issue (Expired / Revoked)

Severity: P1 (user-facing HTTPS broken) / P2 (internal service)

First command:

echo | openssl s_client -connect <hostname>:443 2>/dev/null | openssl x509 -noout -dates -subject -issuer
What you're looking for: notAfter date, whether it's expired. Also check the issuer — is this a known/trusted CA?

Decision tree:

Is the cert expired?
├── Yes  See Networking guide  "TLS Certificate Error" section.
└── No  Is the issuer unexpected / untrusted?
    ├── Yes  POSSIBLE MITM OR CERT SUBSTITUTION.
             Do NOT proceed. Escalate to security team immediately.
             Preserve the cert: echo | openssl s_client -connect <host>:443 2>/dev/null | openssl x509 > /tmp/suspicious-cert.pem
    └── No  Is the cert revoked? (Check with OCSP/CRL)
             openssl ocsp -issuer <ca-cert> -cert <cert> -url <ocsp-url> -resp_text
             If revoked: rotate immediately. Escalate to security.
             If valid but browser warns: intermediate chain missing. Add chain to server config.

Escalation trigger: Unexpected issuer (possible MITM); revoked cert; cert for wrong domain (phishing risk); cannot issue replacement.

Safe actions: Read cert details with openssl — read-only.

Dangerous actions: Accepting an untrusted cert as safe, disabling TLS verification.


Alert: Unauthorized Kubernetes RBAC / Privilege Escalation

Severity: P1

First command:

kubectl get clusterrolebindings,rolebindings -A -o yaml | grep -E "subjects|roleRef" | head -40
# Look for: unexpected users/groups bound to cluster-admin or high-privilege roles
What you're looking for: Unexpected principals (users, service accounts) bound to cluster-admin or roles with create/delete on sensitive resources.

Decision tree:

Is an unexpected service account or user bound to cluster-admin?
├── Yes  Who added this binding?
         kubectl get clusterrolebinding <name> -o yaml | grep -E "annotations|labels"
         Unknown origin?  Delete the binding AND escalate to security.
         kubectl delete clusterrolebinding <name>
└── No  Is a service account with broad permissions compromised?
    ├── Yes  Rotate the service account token:
             kubectl delete secret <sa-token-secret> -n <ns>
             (New token auto-created)
    └── No  New RBAC binding added recently?
             kubectl get events --field-selector=reason=ClusterRoleBindingCreated
             Unauthorized change?  Delete and escalate.

Escalation trigger: Any suspicious cluster-admin binding of unknown origin; service account token leaked externally; cannot determine if access was used.

Safe actions: Read RBAC bindings — read-only.

Dangerous actions: Delete RBAC bindings (may break services), rotate service account tokens.


Quick Reference

Most Useful Commands

# Check for exposed secrets in git
gh api repos/<org>/<repo>/secret-scanning/alerts

# Recent K8s API events
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Who is bound to cluster-admin
kubectl get clusterrolebindings -o json | jq '.items[] | select(.roleRef.name=="cluster-admin") | {name:.metadata.name, subjects:.subjects}'

# All service accounts with their tokens
kubectl get serviceaccounts -A

# Scan running image for CVEs
trivy image <image>:<tag> --severity CRITICAL,HIGH

# Check certificate validity
echo | openssl s_client -connect <host>:443 2>/dev/null | openssl x509 -noout -dates -subject

# Collect audit events (preserve before rotation)
kubectl get events -A > /tmp/k8s-events-$(date +%Y%m%d-%H%M%S).txt

# Check recent secret changes in K8s
kubectl get events -A | grep -i secret

Escalation Contacts

Situation Team Channel
Any unauthorized access Security team immediately #security-incidents
Leaked production secret Security + on-call lead #security-incidents
Critical CVE (CVSS ≥ 9) Security + app team #security-incidents
Suspicious RBAC change Security team #security-incidents
Possible MITM / cert fraud Security team immediately Direct page

Safe vs Dangerous Actions

Safe (do without asking) Dangerous (get approval)
Read audit logs Revoke credentials / access
Scan images for CVEs Delete RBAC bindings
Read RBAC bindings Rotate service account tokens
Preserve logs as evidence Purge git history
Check cert details Network isolation / firewall

Shift Handoff Template

Status: [GREEN/YELLOW/RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]