Ops-Focused Security Basics - Street Ops¶

What experienced ops engineers know about security in practice.

Incident Runbooks¶

Rotating Compromised Credentials¶

1. Confirm the compromise:
   - Was the secret exposed in a Git commit, log file, or error message?
   - Was it found by a scanning tool (GitHub secret scanning, trufflehog)?
   - Was there suspicious API activity using the credential?

2. IMMEDIATELY rotate the credential:
   - Do NOT wait for investigation to complete
   - Do NOT wait for a change request or PR review
   - Generate a new credential in the provider's console
   - This invalidates the old one

3. Update all legitimate consumers:
   - CI/CD pipeline secrets (GitHub Secrets, GitLab CI Variables)
   - Application environment variables
   - Vault/Secrets Manager entries
   - Configuration files (encrypted)
   - Verify each consumer works with the new credential

4. Investigate scope:
   - When was the credential exposed?
   - Was it used by an attacker? Check API audit logs:
     * AWS: CloudTrail
     * GitHub: Audit log
     * Cloud provider console: activity log
   - What resources did the credential have access to?
   - Were any resources modified or data exfiltrated?

5. Prevent recurrence:
   - Remove the secret from Git history (git filter-repo)
   - Add pre-commit hooks to detect secrets
   - Switch to short-lived credentials (OIDC, STS) where possible
   - Review: why was a long-lived credential used here?

6. Document:
   - What was compromised
   - Timeline of exposure and rotation
   - Impact assessment
   - Preventive measures taken

Auditing SSH and Sudo Access¶

1. Who can SSH into production?
   # Check authorized_keys for all users
   for user_home in /home/*; do
     user=$(basename $user_home)
     if [ -f "$user_home/.ssh/authorized_keys" ]; then
       echo "=== $user ==="
       cat "$user_home/.ssh/authorized_keys"
     fi
   done

   # Check root authorized_keys
   cat /root/.ssh/authorized_keys 2>/dev/null

   # Check sshd_config for access restrictions
   grep -E "^(AllowUsers|AllowGroups|DenyUsers|DenyGroups)" /etc/ssh/sshd_config

2. Who has sudo access?
   # Check sudoers
   cat /etc/sudoers
   cat /etc/sudoers.d/*

   # Find users in sudo/wheel group
   getent group sudo 2>/dev/null || getent group wheel 2>/dev/null

   # Check for NOPASSWD entries (risky)
   grep -r "NOPASSWD" /etc/sudoers /etc/sudoers.d/

3. Recent sudo usage:
   # Check auth log
   grep "sudo:" /var/log/auth.log | tail -50          # Debian/Ubuntu
   grep "sudo:" /var/log/secure | tail -50              # RHEL/CentOS
   journalctl _COMM=sudo --since "7 days ago"

4. Recent SSH logins:
   last -20                            # Recent successful logins
   lastb -20                           # Recent failed logins
   journalctl _SYSLOG_IDENTIFIER=sshd --since "7 days ago" | grep "Accepted"

5. Check for unauthorized keys:
   # Compare authorized_keys against your key management system
   # Any key that isn't in your inventory is suspicious
   # Look for keys with no comment (who added it?)

Scanning for Vulnerabilities¶

1. System packages:
   # Check for available security updates
   # Debian/Ubuntu:
   apt update && apt list --upgradable 2>/dev/null | grep -i security
   unattended-upgrade --dry-run

   # RHEL/CentOS:
   dnf check-update --security
   dnf updateinfo list security

2. Container images:
   # Scan with Trivy
   trivy image myapp:latest
   trivy image --severity CRITICAL --exit-code 1 myapp:latest

   # Scan all running containers
   for img in $(docker ps --format '{{.Image}}' | sort -u); do
     echo "=== $img ==="
     trivy image --severity HIGH,CRITICAL "$img" 2>/dev/null
   done

3. Infrastructure as code:
   # Terraform security scanning
   trivy config ./terraform/
   checkov -d ./terraform/
   tfsec ./terraform/

   # Common findings:
   # - S3 buckets without encryption
   # - Security groups open to 0.0.0.0/0
   # - Unencrypted RDS instances
   # - Missing logging/auditing configuration

4. SUID/SGID files (privilege escalation risk):
   find / -type f \( -perm -4000 -o -perm -2000 \) -exec ls -la {} \; 2>/dev/null
   # Review: are all SUID files legitimate system binaries?
   # Unexpected SUID files are a red flag

5. World-writable files and directories:
   find / -type f -perm -002 -not -path "/proc/*" -not -path "/sys/*" 2>/dev/null
   find / -type d -perm -002 -not -path "/proc/*" -not -path "/tmp" 2>/dev/null

Incident Response Basics¶

1. Detection:
   - Alert from monitoring (unusual traffic, error rate spike)
   - Alert from security tool (failed logins, suspicious process)
   - Report from user or external party

2. Triage (first 15 minutes):
   - Is this real or a false positive?
   - What's the scope? One server? One account? Entire environment?
   - Is the attack ongoing or historical?
   - Who needs to be notified?

3. Containment:
   - Isolate affected systems (security group change, network ACL)
   - Disable compromised accounts
   - Rotate affected credentials
   - DO NOT shut down or destroy - preserve evidence

4. Investigation:
   - Timeline: when did the compromise start?
   - Entry point: how did the attacker get in?
   - Lateral movement: what else did they access?
   - Data exfiltration: was data stolen?

   Key evidence sources:
   - SSH auth logs: /var/log/auth.log or /var/log/secure
   - System logs: journalctl, /var/log/syslog
   - Cloud audit logs: CloudTrail, Activity Log
   - Network logs: VPC flow logs, firewall logs
   - Application logs
   - Command history: ~/.bash_history (may be deleted by attacker)

5. Eradication:
   - Patch the vulnerability
   - Remove attacker's access (backdoor accounts, SSH keys, cron jobs)
   - Rebuild compromised systems from clean images
   - Rotate ALL credentials the compromised system had access to

6. Recovery:
   - Restore from clean backups if data was modified
   - Gradually bring systems back online with monitoring
   - Verify integrity of restored systems

7. Post-incident:
   - Write an incident report
   - Identify preventive measures
   - Implement those measures

Gotchas & War Stories¶

The default security group that allowed everything New cloud account, default security group allows all inbound traffic. Someone launches an instance, doesn't configure a security group, gets the default. Database port exposed to the internet. Crypto miners on the box within hours. Prevention: modify the default security group to deny all inbound. Create explicit security groups for each use case.

The service account with admin permissions "Just give it admin access, we'll scope it down later." Later never comes. The CI/CD pipeline now has full admin access to the AWS account. One compromised dependency in the build, and the attacker owns everything. Prevention: start with zero permissions and add only what's needed. Use IAM Access Analyzer to identify unused permissions.

The expired certificate at 2am Certificate expired, HTTPS is broken, users see scary browser warnings. Nobody tracked the expiration date. Prevention: automate with certbot/Let's Encrypt, set up monitoring that alerts 30 days before expiration:

# Check cert expiry from command line
echo | openssl s_client -connect mysite.com:443 2>/dev/null | openssl x509 -noout -enddate

SSH key sprawl Over time, authorized_keys files accumulate keys from former employees, contractors, and "temporary" access grants that were never revoked. Every one of those keys is a potential entry point. Prevention: centralize SSH key management (LDAP, SSO, short-lived certificates with ssh-keygen -s ca_key). Audit authorized_keys quarterly.

The /tmp exploit Attacker writes an exploit to /tmp (world-writable), makes it executable, runs it for privilege escalation. Prevention: mount /tmp with noexec,nosuid in /etc/fstab:

tmpfs /tmp tmpfs defaults,noexec,nosuid,nodev 0 0

Unattended upgrades that broke production Automatic security updates are good in theory, but an unattended kernel upgrade that requires a reboot, or a library update that breaks your app, can cause an outage. Balance: enable unattended upgrades for security patches only, exclude kernel updates, and test in staging first.

Security Audit Checklist¶

# Quick security posture check
echo "=== SSH Config ==="
grep -E "^(PasswordAuth|PermitRoot|AllowGroups)" /etc/ssh/sshd_config

echo "=== Users with shells ==="
grep -v "nologin\|false" /etc/passwd | grep -v "^#"

echo "=== Sudoers ==="
grep -v "^#\|^$" /etc/sudoers 2>/dev/null
cat /etc/sudoers.d/* 2>/dev/null

echo "=== Listening ports ==="
ss -tlnp

echo "=== Firewall rules ==="
iptables -L -n 2>/dev/null || nft list ruleset 2>/dev/null || firewall-cmd --list-all 2>/dev/null

echo "=== Failed SSH attempts (last 24h) ==="
journalctl _SYSLOG_IDENTIFIER=sshd --since "24 hours ago" | grep -c "Failed"

echo "=== SUID files ==="
find / -type f -perm -4000 2>/dev/null | wc -l

echo "=== Pending security updates ==="
apt list --upgradable 2>/dev/null | grep -c security || dnf check-update --security 2>/dev/null | tail -1

Quick Reference¶

Cheatsheet: Security