Infrastructure Forensics - Street-Level Ops¶

What experienced ops engineers know about investigating security incidents before the IR team arrives.

Quick Diagnosis Commands¶

# Is someone else on this system right now?
w                                        # Who is logged in
who -a                                   # All login sessions
last -10                                 # Recent logins
ss -tlnp                                 # Listening services

# Suspicious processes
ps auxf                                  # Full process tree
ps aux --sort=-%cpu | head -20           # CPU hogs (crypto miner?)
ps aux --sort=-%mem | head -20           # Memory hogs

# Suspicious network activity
ss -anp | grep ESTABLISHED              # Active connections
ss -anp | grep -v "127.0.0.1\|::1"     # Non-local connections
lsof -i -P -n                           # All network-connected processes

# Recently modified files (last 24 hours)
find / -mtime -1 -type f -not -path "/proc/*" -not -path "/sys/*" \
  -not -path "/run/*" 2>/dev/null | head -50

# Check for unusual SUID binaries
find / -perm -4000 -type f 2>/dev/null

# Check for hidden directories (common attacker technique)
find / -name ".*" -type d -not -path "/proc/*" -not -path "/sys/*" \
  2>/dev/null | grep -v -E "^\.\.$|^\.ssh$|^\.cache$|^\.config$"

# Verify system binary integrity
rpm -Va 2>/dev/null | grep -E "^..5"     # RHEL: changed MD5
debsums -c 2>/dev/null                    # Debian: changed files

Gotcha: The Crypto Miner You Can't Find¶

CPU usage is pegged at 100%. top shows nothing unusual — all processes look normal. But the system is sluggish and power consumption is high. The attacker replaced top and ps with versions that hide their mining process.

Fix: Don't trust the compromised system's tools. Use these methods instead:

# Check /proc directly — harder to hide from
ls /proc/[0-9]*/exe -la 2>/dev/null | sort -t/ -k3 -n
# Look for processes with deleted or suspicious exe links
# "(deleted)" next to the exe link = binary was deleted after launch (very suspicious)

# Use process accounting if enabled
lastcomm | head -30

# Check CPU usage from /proc/stat (can't be faked by userspace tools)
cat /proc/stat | head -1
# Compare: if /proc/stat shows high CPU but top shows low, tools are lying

# Check for hidden kernel modules (LKM rootkit)
cat /proc/modules | wc -l
lsmod | wc -l
# If the counts differ, a kernel module is hiding itself

# Network approach: check traffic from outside the host
# On your monitoring server or network tap:
# tcpdump on the switch port for this server
# Look for connections to known mining pools

Remember: The cardinal rule of forensics: collect evidence BEFORE you remediate. Once you restart a service, reboot a machine, or kill a process, volatile evidence (memory, network connections, running processes) is gone forever. Write it down, screenshot it, copy it off-box.

Gotcha: Attacker Cleaned Up /var/log¶

You check auth.log. It's suspiciously empty or shows a gap — two hours of missing log entries. The attacker deleted or truncated the logs to cover their tracks.

Fix: Check alternate log sources:

# journald binary logs (harder to selectively edit)
journalctl --since "7 days ago" | grep -i "ssh\|sudo\|failed\|accepted"

# Kernel audit log (if auditd was running)
ausearch -ts recent
ausearch -m USER_LOGIN --start "03/10/2024"

# Last/wtmp (binary log, harder to edit cleanly)
last -f /var/log/wtmp
utmpdump /var/log/wtmp

# Remote syslog (if configured — this is your safety net)
# Check your central syslog/SIEM for this host's logs

# Check for log rotation artifacts
ls -la /var/log/auth.log*
zgrep "Accepted\|Failed" /var/log/auth.log.*.gz

# Compare log file timestamps with expected rotation schedule
stat /var/log/auth.log
# A recent modification time with very little content = tampered

Gotcha: The Unauthorized SSH Key You Almost Missed¶

You audit authorized_keys in /home/* and /root. All clean. But the attacker added their key to the sshd_config AuthorizedKeysFile directive, pointing to an alternate location.

Fix: Check the SSH daemon configuration, not just the default locations:

# What authorized_keys locations does sshd actually use?
grep -i "AuthorizedKeysFile" /etc/ssh/sshd_config
# Default: .ssh/authorized_keys .ssh/authorized_keys2

# Check for AuthorizedKeysCommand (dynamic key lookup)
grep -i "AuthorizedKeysCommand" /etc/ssh/sshd_config
# An attacker could point this to a script that always returns their key

# Check for modifications to PAM config (alternate auth)
grep -r "pam_exec\|pam_script" /etc/pam.d/

# Check for SSH daemon running on non-standard ports
ss -tlnp | grep sshd
# Multiple sshd instances? One might be the attacker's backdoor

# Check if sshd binary itself was replaced
sha256sum /usr/sbin/sshd
rpm -Vf /usr/sbin/sshd    # or dpkg --verify openssh-server

Gotcha: The Cron Job That Only Runs at 3am¶

Attacker installed a cron job that phones home once per day at 3am. It runs for 30 seconds, downloads instructions, executes them, and exits. During the day, the system looks completely clean.

Fix: Check every cron location, not just the obvious ones:

# All the places cron jobs can hide:
cat /etc/crontab
ls -la /etc/cron.d/
ls -la /etc/cron.daily/
ls -la /etc/cron.hourly/

# Per-user crontabs (stored in /var/spool/cron/)
ls -la /var/spool/cron/crontabs/    # Debian
ls -la /var/spool/cron/             # RHEL

# Systemd timers (the modern cron)
systemctl list-timers --all

# at/batch jobs (one-time scheduled tasks)
atq
for job in $(atq | awk '{print $1}'); do at -c "$job" 2>/dev/null; done

# Check for unusual anacron jobs
cat /etc/anacrontab

# Look for recently created cron files
find /etc/cron* /var/spool/cron -mtime -30 -ls 2>/dev/null

Gotcha: Compromised Container, Not Host¶

The host looks clean but a container is compromised. The attacker is operating inside a container and you're only examining the host. Container processes show as normal processes on the host — you need to know which container owns them.

Fix: Correlate host processes to containers:

# Find which container a host PID belongs to
# Get the cgroup of a suspicious process
cat /proc/<pid>/cgroup
# Output shows the container ID

# Or: find all processes in containers
for pid in /proc/[0-9]*/; do
  cgroup=$(cat "${pid}cgroup" 2>/dev/null | head -1)
  if echo "$cgroup" | grep -q "docker\|containerd\|crio"; then
    echo "PID $(basename $pid): $cgroup"
  fi
done

# Inspect the suspicious container
docker inspect <container-id> | jq '.[0].Config.Cmd, .[0].Config.Entrypoint'
docker logs <container-id> --since "24h"
docker diff <container-id>    # Files changed inside the container

# In Kubernetes:
crictl inspect <container-id>
crictl logs <container-id>

Pattern: The 30-Minute First Responder Checklist¶

When you suspect a compromise, work through this in order:

Minutes 0-5: OBSERVE (don't touch anything invasive)
├── w / who           (who is logged in right now?)
├── ps auxf           (full process tree)
├── ss -anp           (all network connections)
├── uptime            (was the system rebooted recently?)
└── dmesg | tail -50  (kernel messages, OOM kills, module loads)

Minutes 5-15: CAPTURE (collect volatile evidence)
├── Run evidence capture script (see primer)
├── Screenshot dashboards showing anomalies
├── Copy evidence off-system to safe location
├── Hash everything: sha256sum evidence/*
└── Start your written incident log

Minutes 15-20: ASSESS
├── Is this a real compromise or a false positive?
├── What data does this server have access to?
├── What other systems can it reach?
├── Is the attack ongoing or historical?
└── What's the blast radius?

Minutes 20-25: CONTAIN
├── Option A: Network isolation (preferred)
│   - Security group change (cloud)
│   - iptables drop all except your SSH (on-prem)
│   - VLAN change to quarantine network
├── Option B: Disable compromised accounts
├── Option C: Revoke compromised credentials/keys
└── DO NOT: power off, reboot, or wipe

Minutes 25-30: NOTIFY
├── Security/IR team (with your evidence and assessment)
├── Management (scope and impact estimate)
├── Relevant on-call (if service is impacted)
└── Hand off to IR team for deep investigation

Pattern: Baseline Comparison for Fast Anomaly Detection¶

You can't find anomalies without knowing what normal looks like:

# Create a baseline snapshot on a known-good system
# Run this monthly or after each infrastructure change

BASELINE="/etc/security/baseline-$(date +%Y%m%d)"
mkdir -p "$BASELINE"

# Package list
rpm -qa --qf '%{NAME}-%{VERSION}-%{RELEASE}.%{ARCH}\n' | sort > "$BASELINE/packages.txt"
# or: dpkg -l | sort > "$BASELINE/packages.txt"

# SUID files
find / -perm -4000 -type f 2>/dev/null | sort > "$BASELINE/suid_files.txt"

# Listening ports
ss -tlnp | sort > "$BASELINE/listeners.txt"

# Enabled services
systemctl list-unit-files --state=enabled | sort > "$BASELINE/services.txt"

# Cron jobs
cat /etc/crontab /etc/cron.d/* > "$BASELINE/cron.txt" 2>/dev/null

# User accounts
cat /etc/passwd | sort > "$BASELINE/users.txt"
cat /etc/group | sort > "$BASELINE/groups.txt"

# During investigation, compare current state to baseline:
diff "$BASELINE/packages.txt" <(rpm -qa --qf '%{NAME}-%{VERSION}-%{RELEASE}.%{ARCH}\n' | sort)
diff "$BASELINE/suid_files.txt" <(find / -perm -4000 -type f 2>/dev/null | sort)
diff "$BASELINE/listeners.txt" <(ss -tlnp | sort)

Emergency: Active Attacker on the System¶

You see an unknown SSH session active right now. Someone is on the system.

1. DO NOT tip them off. Don't kill their session.
   They may have a persistence mechanism that triggers if disconnected.

2. Capture what they're doing:
   # Watch their terminal in real time (if you have root)
   # Find their PTS device:
   w
   # Watch their process:
   strace -fp <their-shell-pid> -e trace=write 2>&1 | strings

3. Capture network traffic:
   tcpdump -i any -w /tmp/capture.pcap host <their-ip> &

4. Notify security team IMMEDIATELY
   - Use out-of-band communication (phone, not Slack on the same infra)
   - Provide: attacker's source IP, current session, observed activity

5. If instructed to contain:
   # Block their IP at the network level (not on the host — they'll see it)
   # Or: change the security group / firewall ACL at the network perimeter
   # This drops their connection without running commands on the compromised host

6. If you MUST cut them off from the host:
   iptables -I INPUT -s <attacker-ip> -j DROP
   # Then kill their session:
   pkill -9 -t <their-pts>

Emergency: Ransomware Detected¶

Files are being encrypted. You see files with unusual extensions (.encrypted, .locked) and ransom notes appearing.

1. IMMEDIATELY isolate the network
   - Pull the network cable if on-prem
   - Disable the network interface: ip link set eth0 down
   - Change security group to deny all if cloud
   - This stops lateral movement and exfiltration

2. DO NOT shut down:
   - Encryption keys may be in memory
   - Forensics team can potentially recover them

3. Identify the scope:
   - Which filesystems are affected?
   - Is encryption still in progress or complete?
   - Which other systems can this server reach?

4. Check for the encryption process:
   ps aux | grep -i encrypt
   lsof | grep -E "\.(encrypted|locked|crypto)"
   # Kill the encryption process IF encryption is still running:
   kill -STOP <pid>    # STOP, not KILL — preserves memory state

5. Notify:
   - Security team / CISO
   - Legal (potential data breach)
   - Management (business continuity decisions)
   - DO NOT contact the attacker or pay ransom without legal guidance

6. Recovery (after IR team clears):
   - Restore from backups (verify backups aren't encrypted too)
   - Rebuild the system from scratch
   - Rotate ALL credentials the system had access to