Portal | Level: L2: Operations | Topics: Infrastructure Forensics, Audit Logging, Linux Hardening | Domain: Security
Infrastructure Forensics - Primer¶
Why This Matters¶
The ops engineer is almost always first on scene when something looks wrong. A server is behaving strangely. Network traffic is going places it shouldn't. An alert fires for a process that shouldn't exist. Before the security team arrives — if there even is a security team — you need to know what to do and what not to do. Infrastructure forensics for ops people isn't about being a forensic analyst. It's about preserving evidence, making the right initial assessment, and not accidentally destroying the proof of what happened.
Analogy: Think of a compromised server like a crime scene. The first officer on scene does not dust for fingerprints -- they secure the perimeter and preserve evidence. Your job as the first responder is the same: contain, preserve, document, hand off. Doing too much is as dangerous as doing too little.
In the military, you learn that the person who discovers a security incident has a responsibility to secure the scene before specialists arrive. The same applies in infrastructure. Your actions in the first 30 minutes determine whether the incident response team has evidence to work with or a wiped-clean system with no trail.
Core Concepts¶
1. The Ops Engineer's Role in Incident Response¶
You're not the investigator. You're first responder. Your job is to:
┌─────────────────────────────────────────────────────┐
│ First Responder Responsibilities │
│ │
│ DO: │
│ ├── Detect and confirm the anomaly │
│ ├── Preserve volatile evidence (RAM, connections) │
│ ├── Document what you see (screenshots, logs) │
│ ├── Contain the damage (isolate, don't destroy) │
│ ├── Notify the security/IR team │
│ └── Maintain chain of custody │
│ │
│ DO NOT: │
│ ├── Reboot the system (destroys volatile data) │
│ ├── Kill suspicious processes (destroys evidence) │
│ ├── Delete suspicious files (you'll need them) │
│ ├── Run unknown cleanup scripts │
│ ├── Log in with shared credentials │
│ └── Discuss the incident on public channels │
└─────────────────────────────────────────────────────┘
2. Checking for Rootkits¶
A rootkit modifies the operating system to hide the attacker's presence. Compromised systems lie to you — ps might not show the malicious process, ls might not show the backdoor file.
# rkhunter (Rootkit Hunter)
apt install rkhunter # or dnf install rkhunter
rkhunter --update # Update signatures
rkhunter --check # Full system scan
# What rkhunter checks:
# - Known rootkit signatures
# - File property changes (permissions, ownership, size)
# - Hidden files and directories
# - Suspicious kernel modules
# - Network interfaces in promiscuous mode
# - Suspicious startup files
# chkrootkit
apt install chkrootkit
chkrootkit # Quick scan
# AIDE (Advanced Intrusion Detection Environment)
# File integrity monitoring — detects unauthorized changes
apt install aide
aide --init # Create baseline database
cp /var/lib/aide/aide.db.new /var/lib/aide/aide.db # Set baseline
aide --check # Compare current state to baseline
# AIDE reports:
# Added files (new files that shouldn't be there)
# Removed files (files that were deleted)
# Changed files (modified binaries, configs, permissions)
Remember: Mnemonic for the forensic first-response order: "Detect, Preserve, Contain, Notify, Document" -- DPCD. Skip any step and the investigation suffers. The most common mistake is jumping straight to "contain" (killing processes, rebooting) and destroying volatile evidence.
Important: If the system is already compromised, local tools may be compromised too. If possible, run these from a known-good live USB or compare against known-good binaries:
# Compare binary against package manager's known-good version
rpm -V openssh-server # RHEL: verify package file integrity
dpkg --verify openssh-server # Debian: verify package files
debsums openssh-server # Debian: checksum verification
# If a system binary was modified:
# s5 (size, mode, md5, device, links, user, group, mtime, capabilities)
rpm -Vf /usr/sbin/sshd
# Output like "S.5....T." means size and md5 changed = COMPROMISED
3. Auditing SSH Access¶
SSH is the most common entry point for unauthorized access:
# Who has logged in recently?
last -20 # Last 20 successful logins
lastb -20 # Last 20 failed login attempts
w # Currently logged-in users
# Authentication log analysis
# Debian/Ubuntu:
grep "Accepted" /var/log/auth.log | tail -30
grep "Failed" /var/log/auth.log | tail -30
# RHEL/CentOS:
grep "Accepted" /var/log/secure | tail -30
grep "Failed" /var/log/secure | tail -30
# systemd journal:
journalctl _SYSLOG_IDENTIFIER=sshd --since "7 days ago" | grep "Accepted"
journalctl _SYSLOG_IDENTIFIER=sshd --since "7 days ago" | grep "Failed"
# Check for unusual SSH patterns:
# 1. Logins from unexpected IP addresses
# 2. Logins at unusual hours
# 3. Logins as root (should be disabled)
# 4. Logins using password (should be key-only)
# 5. Successful login after many failures (brute force succeeded)
# Audit all authorized_keys files
find / -name "authorized_keys" -exec echo "=== {} ===" \; -exec cat {} \; 2>/dev/null
# Look for SSH keys with no comments (who added them?)
# Look for keys you don't recognize
# Compare against your key management inventory
4. Finding Unauthorized Cron Jobs and Services¶
Attackers establish persistence through scheduled tasks and services:
# Check ALL cron locations
# Per-user crontabs:
for user in $(cut -d: -f1 /etc/passwd); do
crontab -l -u "$user" 2>/dev/null && echo "=== $user ==="
done
# System cron directories:
ls -la /etc/cron.d/
ls -la /etc/cron.daily/
ls -la /etc/cron.hourly/
ls -la /etc/cron.weekly/
ls -la /etc/cron.monthly/
cat /etc/crontab
# Systemd timers (modern cron replacement):
systemctl list-timers --all
# Check for unauthorized systemd services:
systemctl list-units --type=service --state=running
# Look for services you don't recognize
# Compare against a known-good baseline
# Check for services enabled at boot:
systemctl list-unit-files --state=enabled
# Check systemd service files for suspicious ExecStart:
find /etc/systemd/system/ /usr/lib/systemd/system/ -name "*.service" \
-newer /etc/os-release -exec echo "=== {} ===" \; -exec grep ExecStart {} \;
# at jobs (one-time scheduled tasks):
atq # List pending at jobs
for job in $(atq | cut -f1); do
echo "=== Job $job ==="
at -c "$job" | tail -5
done
5. Checking for Modified Binaries¶
If key system binaries have been replaced, the system is deeply compromised:
# Quick check: verify package integrity
# RHEL/CentOS:
rpm -Va 2>/dev/null | grep -E "^..5" # Files with changed MD5
# Debian/Ubuntu:
debsums -c 2>/dev/null # List changed conffiles
debsums -s 2>/dev/null # Report errors only
# Check specific critical binaries:
for bin in /usr/bin/ssh /usr/sbin/sshd /usr/bin/sudo /usr/bin/passwd \
/usr/bin/ps /usr/bin/ls /usr/bin/netstat /usr/bin/ss /usr/bin/find; do
if [ -f "$bin" ]; then
echo "=== $bin ==="
file "$bin" # Should be ELF binary, not script
sha256sum "$bin" # Compare against known good hash
stat "$bin" # Check modification time
ls -la "$bin" # Check permissions
fi
done
# Look for SUID/SGID binaries (privilege escalation risk)
find / -type f \( -perm -4000 -o -perm -2000 \) -ls 2>/dev/null
# Look for unexpected SUID binaries:
# Compare against: find / -perm -4000 on a clean system
# New SUID files that weren't there before = highly suspicious
6. Preserving Evidence While Restoring Service¶
The tension: you need to get the service back up, but you also need evidence for the investigation.
Evidence Preservation Priority:
┌─────────────────────────────────────────────────────┐
│ 1. Volatile data (lost on reboot) │
│ ├── Running processes: ps auxf > /evidence/ps │
│ ├── Network connections: ss -tlnp > /evidence/ss│
│ ├── Open files: lsof > /evidence/lsof │
│ ├── Memory: /proc/meminfo, process maps │
│ ├── Routing table: ip route > /evidence/routes │
│ ├── ARP cache: ip neigh > /evidence/arp │
│ ├── Loaded kernel modules: lsmod > /evidence/mod│
│ └── Mount points: mount > /evidence/mounts │
│ │
│ 2. Semi-volatile data (may be rotated) │
│ ├── Log files: tar -czf /evidence/logs.tar.gz │
│ │ /var/log/ │
│ ├── Temp files: tar -czf /evidence/tmp.tar.gz │
│ │ /tmp/ /var/tmp/ │
│ └── User command history: .bash_history files │
│ │
│ 3. Non-volatile data (persistent) │
│ ├── Filesystem timeline (find with timestamps) │
│ ├── Disk image (dd if=/dev/sda of=/evidence/) │
│ ├── Configuration files │
│ └── Application data │
└─────────────────────────────────────────────────────┘
# Quick evidence capture script (run FIRST, before any remediation)
EVIDENCE_DIR="/tmp/evidence-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$EVIDENCE_DIR"
# Volatile data
date > "$EVIDENCE_DIR/timestamp.txt"
hostname >> "$EVIDENCE_DIR/timestamp.txt"
whoami >> "$EVIDENCE_DIR/timestamp.txt"
ps auxf > "$EVIDENCE_DIR/processes.txt"
ss -tlnp > "$EVIDENCE_DIR/network_listeners.txt"
ss -anp > "$EVIDENCE_DIR/all_connections.txt"
ip addr > "$EVIDENCE_DIR/ip_addresses.txt"
ip route > "$EVIDENCE_DIR/routes.txt"
ip neigh > "$EVIDENCE_DIR/arp_cache.txt"
lsmod > "$EVIDENCE_DIR/kernel_modules.txt"
lsof -nP > "$EVIDENCE_DIR/open_files.txt" 2>/dev/null
mount > "$EVIDENCE_DIR/mounts.txt"
cat /proc/meminfo > "$EVIDENCE_DIR/meminfo.txt"
uptime > "$EVIDENCE_DIR/uptime.txt"
last -50 > "$EVIDENCE_DIR/last_logins.txt"
lastb -50 > "$EVIDENCE_DIR/failed_logins.txt" 2>/dev/null
# Semi-volatile
cp -r /var/log/ "$EVIDENCE_DIR/var_log/" 2>/dev/null
for user_home in /home/* /root; do
user=$(basename "$user_home")
cp "$user_home/.bash_history" "$EVIDENCE_DIR/history_${user}.txt" 2>/dev/null
done
# Hash the evidence
find "$EVIDENCE_DIR" -type f -exec sha256sum {} \; > "$EVIDENCE_DIR/checksums.sha256"
echo "Evidence captured to: $EVIDENCE_DIR"
# Copy this directory OFF the compromised system immediately
# scp -r $EVIDENCE_DIR user@safe-server:/evidence/
7. Timeline Reconstruction from Logs¶
Building a timeline of what happened and when:
# Find files modified in the suspicious time window
find / -mtime -7 -type f -not -path "/proc/*" -not -path "/sys/*" \
-printf "%T+ %p\n" 2>/dev/null | sort > /tmp/timeline.txt
# Find files modified in a specific time range
find / -newermt "2024-03-10 00:00" -not -newermt "2024-03-11 00:00" \
-type f -not -path "/proc/*" -printf "%T+ %p\n" 2>/dev/null | sort
# Correlate events across log files
# Build a unified timeline:
# 1. SSH logins (auth.log/secure)
# 2. sudo usage
# 3. Cron executions
# 4. Service starts/stops
# 5. Package installations
# 6. File modifications
# Package installation history:
# RHEL:
rpm -qa --last | head -30
# Debian:
grep " install " /var/log/dpkg.log | tail -30
# or
zgrep " install " /var/log/dpkg.log.* | sort
8. Chain of Custody Basics¶
If the incident may lead to legal action, evidence handling matters:
Chain of Custody Rules:
1. Document who collected the evidence and when
2. Hash all evidence files immediately (SHA-256)
3. Store evidence on a write-once medium or read-only mount
4. Every person who handles the evidence is logged
5. Never modify the original — work on copies
6. Keep a written log of every action taken on the system
Evidence log format:
┌─────────────────────────────────────────────────────┐
│ Date/Time │ Who │ Action │
│ 2024-03-10 │ esmith │ Noticed unusual process │
│ 14:32 UTC │ │ via Prometheus alert │
│ 2024-03-10 │ esmith │ Ran evidence capture │
│ 14:35 UTC │ │ script, copied to safe │
│ 2024-03-10 │ esmith │ Notified IR team via │
│ 14:40 UTC │ │ #security-incidents │
│ 2024-03-10 │ jdoe │ IR team took over │
│ 15:00 UTC │ │ investigation │
└─────────────────────────────────────────────────────┘
9. Working with the Security Team¶
When the IR team arrives (or when you're on the phone with them):
What they'll ask you:
├── When did you first notice the anomaly?
├── What alerted you? (monitoring, user report, gut feeling)
├── What did you touch on the system? (commands you ran)
├── Did you reboot or restart anything?
├── Is the system still running in its current state?
├── What evidence have you collected?
├── What's the business impact? (service down, data exposed?)
└── Who else knows about this?
What you should have ready:
├── The evidence directory (hashed, copied off-system)
├── Your timeline of actions (what you did and when)
├── System context (what role this server plays, what data it has)
├── Network diagram showing the server's connectivity
├── List of services running on the server
└── Whether the server has access to other sensitive systems
10. DoD/Military Security Awareness Applied to Infrastructure¶
Military-grade security thinking translates directly to infrastructure:
Military Concept → Infrastructure Application
─────────────────────────────────────────────────────
Need to know → Least-privilege access
Defense in depth → Multiple security layers
Operational security → Don't discuss incidents publicly
After-action review → Postmortem / incident review
Watch standing → On-call discipline
Classified handling → Secrets management (Vault, KMS)
COMSEC → TLS everywhere, encrypted at rest
Physical security → Server room access controls
Security clearance levels → RBAC tiers (admin, operator, viewer)
Threat briefing → Security advisory monitoring (CVEs)
Common Pitfalls¶
- Rebooting the server to "fix" the problem. You just destroyed all volatile evidence — running processes, network connections, memory contents. Capture evidence first, then contain, then remediate.
- Running
rmon suspicious files. You need those files for analysis. Quarantine them (move to an evidence directory) or leave them in place. The IR team will want to examine them. -
Assuming you weren't compromised because the rootkit scanner found nothing. Sophisticated rootkits can hide from scanners running on the compromised system. If you have strong indicators of compromise, trust the indicators over the scanner results.
Under the hood: AIDE (Advanced Intrusion Detection Environment) works by building a database of file hashes, permissions, and timestamps during
aide --init. On subsequentaide --checkruns, it compares the current state against the baseline and reports differences. The baseline must be stored somewhere the attacker cannot modify -- ideally off-system (read-only NFS, S3). If the baseline is on the compromised host, the attacker can modify it to hide their changes. -
Not having an AIDE/Tripwire baseline. File integrity monitoring is useless without a known-good baseline to compare against. Set it up before you need it, not during an incident.
- Investigating alone. You found something suspicious. You're competent. You start investigating on your own without telling anyone. Hours later, you've accidentally tipped off the attacker by running commands they can see in their access. Always notify the security team immediately.
- Forgetting that logs can be tampered with. A skilled attacker will modify or delete local logs. Centralized logging (shipping logs off-server in real time) is your safety net. If your logging pipeline sends to a remote SIEM, the attacker can't retroactively modify those logs.
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
- Security Basics (Ops-Focused) (Topic Pack, L1)
Related Content¶
- Compliance & Audit Automation (Topic Pack, L2) — Audit Logging, Linux Hardening
- SELinux & Linux Hardening (Topic Pack, L2) — Audit Logging, Linux Hardening
- Audit Logging (Topic Pack, L1) — Audit Logging
- Audit Logging Flashcards (CLI) (flashcard_deck, L1) — Audit Logging
- Deep Dive: Systemd Service Design Debugging and Hardening (deep_dive, L2) — Linux Hardening
- LDAP & Identity Management (Topic Pack, L2) — Linux Hardening
- Linux Security Flashcards (CLI) (flashcard_deck, L1) — Linux Hardening
- Linux Users & Permissions (Topic Pack, L1) — Linux Hardening
- Runbook: Unauthorized Access Investigation (Runbook, L2) — Audit Logging
- SELinux & AppArmor (Topic Pack, L2) — Linux Hardening
Pages that link here¶
- Anti-Primer: Infra Forensics
- Audit Logging
- Compliance & Audit Automation
- Index
- Infrastructure Forensics
- LDAP & Identity Management
- Linux Users and Permissions
- Master Curriculum: 40 Weeks
- Runbook: Unauthorized Access Investigation
- SELinux & AppArmor
- SELinux & Linux Hardening
- Security Basics (Ops-Focused)
- systemd Service Design, Debugging, and Hardening