On-Call Survival: Security¶

Print this. Pin it. Read it at 3 AM.

When in doubt: contain first, investigate second, explain third.

Alert: Compromised Credentials / Leaked Secret¶

Severity: P1

First command:

# Identify scope: which secret, where it was used
git log --all --oneline | head -20    # Was it committed to git?
# Check secret scanning alerts in GitHub Security tab
gh api repos/<org>/<repo>/secret-scanning/alerts --jq '.[].secret_type'

What you're looking for: What type of credential (API key, DB password, token), where it was exposed (git history, logs, public repo), and what it grants access to.

Decision tree:

Was the secret committed to git?
├── Yes → ROTATE IMMEDIATELY (even if repo is private — assume it was indexed).
│         Revoke old credential in the issuing system (AWS console, GitHub settings, etc.)
│         Issue new credential. Update secret in K8s/CI.
│         Purge from git history: escalate to git admin (requires forced history rewrite).
│         Log the incident: what was exposed, when, rotation timestamp.
└── No → Was it exposed in logs / error messages?
    ├── Yes → Rotate credential. Truncate/delete affected log files.
    │         Check who has log access. Escalate to security team.
    └── No → Phishing / social engineering?
             → Escalate to security team immediately. Do not investigate alone.

Escalation trigger: Secret grants production DB/cloud access; secret has been active for > 1 hour post-exposure; cannot identify exposure scope; evidence of use by unauthorized party.

Safe actions: Identify scope, check secret scanning alerts — read-only before escalation.

Dangerous actions: Rotating credentials (brief service disruption), purging git history (destructive, requires coordination).

Alert: Unauthorized Access / Suspicious Activity¶

Severity: P1

First command:

# Kubernetes: who has been accessing the API server
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
# Cloud: recent API calls (AWS CloudTrail / GCP Audit Logs)
# Check: logins from unexpected IPs, unusual resource creation/deletion

What you're looking for: Actions from unexpected users, IPs, service accounts, or unusual times (off-hours spikes).

Decision tree:

Is there an active session / connection still open?
├── Yes → Contain immediately:
│         Kubernetes: kubectl delete rolebinding/clusterrolebinding <suspicious-binding>
│         Cloud: Revoke IAM user access key or assume-role session
│         SSH: pkill -u <user> or block at firewall level
│         THEN: collect evidence before more cleanup (screenshots, logs)
└── No (historical activity, no active session)?
    ├── Assess blast radius: what did the intruder access/create/delete?
    ├── Preserve logs: copy audit logs before they rotate
    │   kubectl get events -A > /tmp/k8s-events-$(date +%Y%m%d).txt
    └── Escalate to security team with: actor, actions, timeline, resources affected.

Escalation trigger: ANY unauthorized access to production systems. Do not try to resolve alone — escalate immediately.

Safe actions: Read audit logs, get events, identify suspicious actors — read-only.

Dangerous actions: Revoke access (may alert intruder to containment), delete evidence, delete resources (preserve for forensics).

Alert: Critical CVE in Running Container/Package¶

Severity: P1 (CVSS ≥ 9, exploitable in your context) / P2 (CVSS 7-9)

First command:

# Check which images are running
kubectl get pods -A -o jsonpath='{range .items[*]}{.spec.containers[*].image}{"\n"}{end}' | sort -u
# Check CVE scanner output (Trivy, Snyk, Grype)
trivy image <image>:<tag> --severity CRITICAL,HIGH

What you're looking for: Is the vulnerable package in a running image? Is the vulnerability exploitable (network-accessible, requires user input, etc.)?

Decision tree:

Is the vulnerability actively exploitable in your deployment?
├── No (not network-reachable, not in code path) → Log it. Schedule patch within SLA.
│   CVSS 9+: patch within 7 days. CVSS 7-9: patch within 30 days.
└── Yes → Is a patched base image available?
    ├── Yes → Rebuild image with patched base. Test. Deploy.
    │          Fast track through CI: treat as hotfix.
    └── No → Mitigate while waiting for patch:
             - Network policy: restrict access to affected service
             - WAF rule: block exploit patterns if known
             - Disable the vulnerable feature if possible
             Escalate to security: "CVE <id> in <image>, no patch available, mitigation applied: <describe>"

Escalation trigger: CVSS ≥ 9 with active exploit in the wild; vulnerability in auth/crypto code; evidence of exploitation attempt in logs.

Safe actions: Scan images, check CVE details — read-only.

Dangerous actions: Deploying unvetted patches to production, disabling security controls to test exploitability.

Alert: Certificate Issue (Expired / Revoked)¶

Severity: P1 (user-facing HTTPS broken) / P2 (internal service)

First command:

echo | openssl s_client -connect <hostname>:443 2>/dev/null | openssl x509 -noout -dates -subject -issuer

What you're looking for: notAfter date, whether it's expired. Also check the issuer — is this a known/trusted CA?

Decision tree:

Is the cert expired?
├── Yes → See Networking guide → "TLS Certificate Error" section.
└── No → Is the issuer unexpected / untrusted?
    ├── Yes → POSSIBLE MITM OR CERT SUBSTITUTION.
    │         Do NOT proceed. Escalate to security team immediately.
    │         Preserve the cert: echo | openssl s_client -connect <host>:443 2>/dev/null | openssl x509 > /tmp/suspicious-cert.pem
    └── No → Is the cert revoked? (Check with OCSP/CRL)
             openssl ocsp -issuer <ca-cert> -cert <cert> -url <ocsp-url> -resp_text
             If revoked: rotate immediately. Escalate to security.
             If valid but browser warns: intermediate chain missing. Add chain to server config.

Escalation trigger: Unexpected issuer (possible MITM); revoked cert; cert for wrong domain (phishing risk); cannot issue replacement.

Safe actions: Read cert details with openssl — read-only.

Dangerous actions: Accepting an untrusted cert as safe, disabling TLS verification.

Alert: Unauthorized Kubernetes RBAC / Privilege Escalation¶

Severity: P1

First command:

kubectl get clusterrolebindings,rolebindings -A -o yaml | grep -E "subjects|roleRef" | head -40
# Look for: unexpected users/groups bound to cluster-admin or high-privilege roles

What you're looking for: Unexpected principals (users, service accounts) bound to cluster-admin or roles with create/delete on sensitive resources.

Decision tree:

Is an unexpected service account or user bound to cluster-admin?
├── Yes → Who added this binding?
│         kubectl get clusterrolebinding <name> -o yaml | grep -E "annotations|labels"
│         Unknown origin? → Delete the binding AND escalate to security.
│         kubectl delete clusterrolebinding <name>
└── No → Is a service account with broad permissions compromised?
    ├── Yes → Rotate the service account token:
    │         kubectl delete secret <sa-token-secret> -n <ns>
    │         (New token auto-created)
    └── No → New RBAC binding added recently?
             kubectl get events --field-selector=reason=ClusterRoleBindingCreated
             Unauthorized change? → Delete and escalate.

Escalation trigger: Any suspicious cluster-admin binding of unknown origin; service account token leaked externally; cannot determine if access was used.

Safe actions: Read RBAC bindings — read-only.

Dangerous actions: Delete RBAC bindings (may break services), rotate service account tokens.

Quick Reference¶

Most Useful Commands¶

# Check for exposed secrets in git
gh api repos/<org>/<repo>/secret-scanning/alerts

# Recent K8s API events
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Who is bound to cluster-admin
kubectl get clusterrolebindings -o json | jq '.items[] | select(.roleRef.name=="cluster-admin") | {name:.metadata.name, subjects:.subjects}'

# All service accounts with their tokens
kubectl get serviceaccounts -A

# Scan running image for CVEs
trivy image <image>:<tag> --severity CRITICAL,HIGH

# Check certificate validity
echo | openssl s_client -connect <host>:443 2>/dev/null | openssl x509 -noout -dates -subject

# Collect audit events (preserve before rotation)
kubectl get events -A > /tmp/k8s-events-$(date +%Y%m%d-%H%M%S).txt

# Check recent secret changes in K8s
kubectl get events -A | grep -i secret

Escalation Contacts¶

Situation	Team	Channel
Any unauthorized access	Security team immediately	#security-incidents
Leaked production secret	Security + on-call lead	#security-incidents
Critical CVE (CVSS ≥ 9)	Security + app team	#security-incidents
Suspicious RBAC change	Security team	#security-incidents
Possible MITM / cert fraud	Security team immediately	Direct page

Safe vs Dangerous Actions¶

Safe (do without asking)	Dangerous (get approval)
Read audit logs	Revoke credentials / access
Scan images for CVEs	Delete RBAC bindings
Read RBAC bindings	Rotate service account tokens
Preserve logs as evidence	Purge git history
Check cert details	Network isolation / firewall

Shift Handoff Template¶

Status: [GREEN/YELLOW/RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]