Skip to content

Portal | Level: L2: Operations | Topics: Infrastructure Forensics, Audit Logging, Linux Hardening | Domain: Security

Infrastructure Forensics - Primer

Why This Matters

The ops engineer is almost always first on scene when something looks wrong. A server is behaving strangely. Network traffic is going places it shouldn't. An alert fires for a process that shouldn't exist. Before the security team arrives — if there even is a security team — you need to know what to do and what not to do. Infrastructure forensics for ops people isn't about being a forensic analyst. It's about preserving evidence, making the right initial assessment, and not accidentally destroying the proof of what happened.

Analogy: Think of a compromised server like a crime scene. The first officer on scene does not dust for fingerprints -- they secure the perimeter and preserve evidence. Your job as the first responder is the same: contain, preserve, document, hand off. Doing too much is as dangerous as doing too little.

In the military, you learn that the person who discovers a security incident has a responsibility to secure the scene before specialists arrive. The same applies in infrastructure. Your actions in the first 30 minutes determine whether the incident response team has evidence to work with or a wiped-clean system with no trail.

Core Concepts

1. The Ops Engineer's Role in Incident Response

You're not the investigator. You're first responder. Your job is to:

┌─────────────────────────────────────────────────────┐
│  First Responder Responsibilities                    │
│                                                      │
│  DO:                                                 │
│  ├── Detect and confirm the anomaly                 │
│  ├── Preserve volatile evidence (RAM, connections)  │
│  ├── Document what you see (screenshots, logs)      │
│  ├── Contain the damage (isolate, don't destroy)    │
│  ├── Notify the security/IR team                    │
│  └── Maintain chain of custody                      │
│                                                      │
│  DO NOT:                                             │
│  ├── Reboot the system (destroys volatile data)     │
│  ├── Kill suspicious processes (destroys evidence)  │
│  ├── Delete suspicious files (you'll need them)     │
│  ├── Run unknown cleanup scripts                    │
│  ├── Log in with shared credentials                 │
│  └── Discuss the incident on public channels        │
└─────────────────────────────────────────────────────┘

2. Checking for Rootkits

A rootkit modifies the operating system to hide the attacker's presence. Compromised systems lie to you — ps might not show the malicious process, ls might not show the backdoor file.

# rkhunter (Rootkit Hunter)
apt install rkhunter    # or dnf install rkhunter
rkhunter --update       # Update signatures
rkhunter --check        # Full system scan

# What rkhunter checks:
# - Known rootkit signatures
# - File property changes (permissions, ownership, size)
# - Hidden files and directories
# - Suspicious kernel modules
# - Network interfaces in promiscuous mode
# - Suspicious startup files

# chkrootkit
apt install chkrootkit
chkrootkit              # Quick scan

# AIDE (Advanced Intrusion Detection Environment)
# File integrity monitoring — detects unauthorized changes
apt install aide
aide --init                              # Create baseline database
cp /var/lib/aide/aide.db.new /var/lib/aide/aide.db  # Set baseline
aide --check                             # Compare current state to baseline

# AIDE reports:
# Added files (new files that shouldn't be there)
# Removed files (files that were deleted)
# Changed files (modified binaries, configs, permissions)

Remember: Mnemonic for the forensic first-response order: "Detect, Preserve, Contain, Notify, Document" -- DPCD. Skip any step and the investigation suffers. The most common mistake is jumping straight to "contain" (killing processes, rebooting) and destroying volatile evidence.

Important: If the system is already compromised, local tools may be compromised too. If possible, run these from a known-good live USB or compare against known-good binaries:

# Compare binary against package manager's known-good version
rpm -V openssh-server     # RHEL: verify package file integrity
dpkg --verify openssh-server  # Debian: verify package files
debsums openssh-server    # Debian: checksum verification

# If a system binary was modified:
# s5 (size, mode, md5, device, links, user, group, mtime, capabilities)
rpm -Vf /usr/sbin/sshd
# Output like "S.5....T." means size and md5 changed = COMPROMISED

3. Auditing SSH Access

SSH is the most common entry point for unauthorized access:

# Who has logged in recently?
last -20                              # Last 20 successful logins
lastb -20                             # Last 20 failed login attempts
w                                     # Currently logged-in users

# Authentication log analysis
# Debian/Ubuntu:
grep "Accepted" /var/log/auth.log | tail -30
grep "Failed" /var/log/auth.log | tail -30

# RHEL/CentOS:
grep "Accepted" /var/log/secure | tail -30
grep "Failed" /var/log/secure | tail -30

# systemd journal:
journalctl _SYSLOG_IDENTIFIER=sshd --since "7 days ago" | grep "Accepted"
journalctl _SYSLOG_IDENTIFIER=sshd --since "7 days ago" | grep "Failed"

# Check for unusual SSH patterns:
# 1. Logins from unexpected IP addresses
# 2. Logins at unusual hours
# 3. Logins as root (should be disabled)
# 4. Logins using password (should be key-only)
# 5. Successful login after many failures (brute force succeeded)

# Audit all authorized_keys files
find / -name "authorized_keys" -exec echo "=== {} ===" \; -exec cat {} \; 2>/dev/null

# Look for SSH keys with no comments (who added them?)
# Look for keys you don't recognize
# Compare against your key management inventory

4. Finding Unauthorized Cron Jobs and Services

Attackers establish persistence through scheduled tasks and services:

# Check ALL cron locations
# Per-user crontabs:
for user in $(cut -d: -f1 /etc/passwd); do
  crontab -l -u "$user" 2>/dev/null && echo "=== $user ==="
done

# System cron directories:
ls -la /etc/cron.d/
ls -la /etc/cron.daily/
ls -la /etc/cron.hourly/
ls -la /etc/cron.weekly/
ls -la /etc/cron.monthly/
cat /etc/crontab

# Systemd timers (modern cron replacement):
systemctl list-timers --all

# Check for unauthorized systemd services:
systemctl list-units --type=service --state=running
# Look for services you don't recognize
# Compare against a known-good baseline

# Check for services enabled at boot:
systemctl list-unit-files --state=enabled

# Check systemd service files for suspicious ExecStart:
find /etc/systemd/system/ /usr/lib/systemd/system/ -name "*.service" \
  -newer /etc/os-release -exec echo "=== {} ===" \; -exec grep ExecStart {} \;

# at jobs (one-time scheduled tasks):
atq                                   # List pending at jobs
for job in $(atq | cut -f1); do
  echo "=== Job $job ==="
  at -c "$job" | tail -5
done

5. Checking for Modified Binaries

If key system binaries have been replaced, the system is deeply compromised:

# Quick check: verify package integrity
# RHEL/CentOS:
rpm -Va 2>/dev/null | grep -E "^..5"  # Files with changed MD5

# Debian/Ubuntu:
debsums -c 2>/dev/null                 # List changed conffiles
debsums -s 2>/dev/null                 # Report errors only

# Check specific critical binaries:
for bin in /usr/bin/ssh /usr/sbin/sshd /usr/bin/sudo /usr/bin/passwd \
  /usr/bin/ps /usr/bin/ls /usr/bin/netstat /usr/bin/ss /usr/bin/find; do
  if [ -f "$bin" ]; then
    echo "=== $bin ==="
    file "$bin"                        # Should be ELF binary, not script
    sha256sum "$bin"                   # Compare against known good hash
    stat "$bin"                        # Check modification time
    ls -la "$bin"                      # Check permissions
  fi
done

# Look for SUID/SGID binaries (privilege escalation risk)
find / -type f \( -perm -4000 -o -perm -2000 \) -ls 2>/dev/null

# Look for unexpected SUID binaries:
# Compare against: find / -perm -4000 on a clean system
# New SUID files that weren't there before = highly suspicious

6. Preserving Evidence While Restoring Service

The tension: you need to get the service back up, but you also need evidence for the investigation.

Evidence Preservation Priority:
┌─────────────────────────────────────────────────────┐
  1. Volatile data (lost on reboot)                       ├── Running processes: ps auxf > /evidence/ps        ├── Network connections: ss -tlnp > /evidence/ss│
     ├── Open files: lsof > /evidence/lsof                ├── Memory: /proc/meminfo, process maps              ├── Routing table: ip route > /evidence/routes       ├── ARP cache: ip neigh > /evidence/arp              ├── Loaded kernel modules: lsmod > /evidence/mod│
     └── Mount points: mount > /evidence/mounts                                                              2. Semi-volatile data (may be rotated)                  ├── Log files: tar -czf /evidence/logs.tar.gz           /var/log/                                         ├── Temp files: tar -czf /evidence/tmp.tar.gz           /tmp/ /var/tmp/                                   └── User command history: .bash_history files                                                            3. Non-volatile data (persistent)                       ├── Filesystem timeline (find with timestamps)       ├── Disk image (dd if=/dev/sda of=/evidence/)        ├── Configuration files                               └── Application data                             └─────────────────────────────────────────────────────┘
# Quick evidence capture script (run FIRST, before any remediation)
EVIDENCE_DIR="/tmp/evidence-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$EVIDENCE_DIR"

# Volatile data
date > "$EVIDENCE_DIR/timestamp.txt"
hostname >> "$EVIDENCE_DIR/timestamp.txt"
whoami >> "$EVIDENCE_DIR/timestamp.txt"
ps auxf > "$EVIDENCE_DIR/processes.txt"
ss -tlnp > "$EVIDENCE_DIR/network_listeners.txt"
ss -anp > "$EVIDENCE_DIR/all_connections.txt"
ip addr > "$EVIDENCE_DIR/ip_addresses.txt"
ip route > "$EVIDENCE_DIR/routes.txt"
ip neigh > "$EVIDENCE_DIR/arp_cache.txt"
lsmod > "$EVIDENCE_DIR/kernel_modules.txt"
lsof -nP > "$EVIDENCE_DIR/open_files.txt" 2>/dev/null
mount > "$EVIDENCE_DIR/mounts.txt"
cat /proc/meminfo > "$EVIDENCE_DIR/meminfo.txt"
uptime > "$EVIDENCE_DIR/uptime.txt"
last -50 > "$EVIDENCE_DIR/last_logins.txt"
lastb -50 > "$EVIDENCE_DIR/failed_logins.txt" 2>/dev/null

# Semi-volatile
cp -r /var/log/ "$EVIDENCE_DIR/var_log/" 2>/dev/null
for user_home in /home/* /root; do
  user=$(basename "$user_home")
  cp "$user_home/.bash_history" "$EVIDENCE_DIR/history_${user}.txt" 2>/dev/null
done

# Hash the evidence
find "$EVIDENCE_DIR" -type f -exec sha256sum {} \; > "$EVIDENCE_DIR/checksums.sha256"

echo "Evidence captured to: $EVIDENCE_DIR"
# Copy this directory OFF the compromised system immediately
# scp -r $EVIDENCE_DIR user@safe-server:/evidence/

7. Timeline Reconstruction from Logs

Building a timeline of what happened and when:

# Find files modified in the suspicious time window
find / -mtime -7 -type f -not -path "/proc/*" -not -path "/sys/*" \
  -printf "%T+ %p\n" 2>/dev/null | sort > /tmp/timeline.txt

# Find files modified in a specific time range
find / -newermt "2024-03-10 00:00" -not -newermt "2024-03-11 00:00" \
  -type f -not -path "/proc/*" -printf "%T+ %p\n" 2>/dev/null | sort

# Correlate events across log files
# Build a unified timeline:
# 1. SSH logins (auth.log/secure)
# 2. sudo usage
# 3. Cron executions
# 4. Service starts/stops
# 5. Package installations
# 6. File modifications

# Package installation history:
# RHEL:
rpm -qa --last | head -30

# Debian:
grep " install " /var/log/dpkg.log | tail -30
# or
zgrep " install " /var/log/dpkg.log.* | sort

8. Chain of Custody Basics

If the incident may lead to legal action, evidence handling matters:

Chain of Custody Rules:
1. Document who collected the evidence and when
2. Hash all evidence files immediately (SHA-256)
3. Store evidence on a write-once medium or read-only mount
4. Every person who handles the evidence is logged
5. Never modify the original  work on copies
6. Keep a written log of every action taken on the system

Evidence log format:
┌─────────────────────────────────────────────────────┐
  Date/Time      Who     Action                      2024-03-10     esmith  Noticed unusual process     14:32 UTC              via Prometheus alert         2024-03-10     esmith  Ran evidence capture        14:35 UTC              script, copied to safe      2024-03-10     esmith  Notified IR team via        14:40 UTC              #security-incidents       │
  2024-03-10     jdoe    IR team took over           15:00 UTC              investigation             └─────────────────────────────────────────────────────┘

9. Working with the Security Team

When the IR team arrives (or when you're on the phone with them):

What they'll ask you:
├── When did you first notice the anomaly?
├── What alerted you? (monitoring, user report, gut feeling)
├── What did you touch on the system? (commands you ran)
├── Did you reboot or restart anything?
├── Is the system still running in its current state?
├── What evidence have you collected?
├── What's the business impact? (service down, data exposed?)
└── Who else knows about this?

What you should have ready:
├── The evidence directory (hashed, copied off-system)
├── Your timeline of actions (what you did and when)
├── System context (what role this server plays, what data it has)
├── Network diagram showing the server's connectivity
├── List of services running on the server
└── Whether the server has access to other sensitive systems

10. DoD/Military Security Awareness Applied to Infrastructure

Military-grade security thinking translates directly to infrastructure:

Military Concept          → Infrastructure Application
─────────────────────────────────────────────────────
Need to know              → Least-privilege access
Defense in depth          → Multiple security layers
Operational security      → Don't discuss incidents publicly
After-action review       → Postmortem / incident review
Watch standing            → On-call discipline
Classified handling       → Secrets management (Vault, KMS)
COMSEC                    → TLS everywhere, encrypted at rest
Physical security         → Server room access controls
Security clearance levels → RBAC tiers (admin, operator, viewer)
Threat briefing           → Security advisory monitoring (CVEs)

Common Pitfalls

  • Rebooting the server to "fix" the problem. You just destroyed all volatile evidence — running processes, network connections, memory contents. Capture evidence first, then contain, then remediate.
  • Running rm on suspicious files. You need those files for analysis. Quarantine them (move to an evidence directory) or leave them in place. The IR team will want to examine them.
  • Assuming you weren't compromised because the rootkit scanner found nothing. Sophisticated rootkits can hide from scanners running on the compromised system. If you have strong indicators of compromise, trust the indicators over the scanner results.

    Under the hood: AIDE (Advanced Intrusion Detection Environment) works by building a database of file hashes, permissions, and timestamps during aide --init. On subsequent aide --check runs, it compares the current state against the baseline and reports differences. The baseline must be stored somewhere the attacker cannot modify -- ideally off-system (read-only NFS, S3). If the baseline is on the compromised host, the attacker can modify it to hide their changes.

  • Not having an AIDE/Tripwire baseline. File integrity monitoring is useless without a known-good baseline to compare against. Set it up before you need it, not during an incident.

  • Investigating alone. You found something suspicious. You're competent. You start investigating on your own without telling anyone. Hours later, you've accidentally tipped off the attacker by running commands they can see in their access. Always notify the security team immediately.
  • Forgetting that logs can be tampered with. A skilled attacker will modify or delete local logs. Centralized logging (shipping logs off-server in real time) is your safety net. If your logging pipeline sends to a remote SIEM, the attacker can't retroactively modify those logs.

Wiki Navigation

Prerequisites