Skip to content

Linux The Complete Guide

  • lesson
  • boot-process-(bios/uefi
  • grub
  • kernel
  • initramfs
  • systemd)
  • processes-&-signals
  • filesystems-&-storage-(ext4
  • xfs
  • lvm
  • raid)
  • memory-management-(virtual-memory
  • page-cache
  • swap
  • oom-killer)
  • networking-(tcp/ip
  • iptables
  • bridges
  • bonds
  • vlans)
  • ssh
  • permissions-&-security-(selinux
  • apparmor
  • hardening)
  • debugging-(strace
  • /proc
  • ebpf
  • performance)
  • systemd-(unit-files
  • targets
  • timers
  • journald)
  • package-management
  • cgroups-&-namespaces
  • kernel-tuning
  • text-processing
  • logging ---# Linux — The Complete Guide: From Power Button to Production Mastery

Topics: Boot process (BIOS/UEFI, GRUB, kernel, initramfs, systemd), processes & signals, filesystems & storage (ext4, XFS, LVM, RAID), memory management (virtual memory, page cache, swap, OOM killer), networking (TCP/IP, iptables, bridges, bonds, VLANs), SSH, permissions & security (SELinux, AppArmor, hardening), debugging (strace, /proc, eBPF, performance), systemd (unit files, targets, timers, journald), package management, cgroups & namespaces, kernel tuning, text processing, logging Strategy: Build-up from bare metal to production operations, with war stories, trivia, and drills throughout Level: L0–L2 (Zero → Foundations → Operations) Time: 5–6 hours (designed for deep study in one or multiple sittings) Prerequisites: Access to a Linux terminal. No prior Linux experience required — everything is explained from scratch.


The Mission

A rack-mount server sits in a datacenter. You press the power button. Forty-five seconds later, you SSH in, check disk space, restart a service, and deploy an application. In those 45 seconds, the machine went from no electricity to a running Linux system with a filesystem, a network stack, 200 services, and a login prompt. It executed firmware from the 1970s, a bootloader that's a small operating system, a kernel that unpacked itself from a compressed archive, a temporary filesystem that exists only in RAM, and a process manager that started everything in parallel.

By the end of this guide you'll understand every layer of that stack — from the moment electricity hits the motherboard to the moment you debug a production issue at 3 AM. This is the one document you need to go from "I type commands in a terminal" to "I understand what Linux is actually doing."


Table of Contents

  1. The Boot Sequence — Power to Login
  2. The Kernel — What Linux Actually Is
  3. systemd — The Process Manager
  4. Processes, Signals, and Process Control
  5. Users, Permissions, and Ownership
  6. The Filesystem — Everything Is a File
  7. Storage — Disks, Partitions, LVM, RAID
  8. Memory Management — Virtual Memory to OOM Killer
  9. Networking Fundamentals — TCP/IP, DNS, Routing
  10. iptables and Firewalls — Following a Packet
  11. SSH — The Protocol That Runs Infrastructure
  12. The /proc Filesystem — Linux's Hidden API
  13. Debugging with strace — Reading System Calls
  14. Performance Triage — The USE Method
  15. Logging — journald, syslog, and Log Management
  16. Package Management — apt, dnf, and Friends
  17. Text Processing — grep, awk, sed, and the Pipeline
  18. cgroups and Namespaces — Container Foundations
  19. Security Hardening — Closing the Doors
  20. eBPF — The Linux Superpower
  21. Linux Distributions — Choosing and Understanding
  22. On-Call Survival Guide
  23. Real-World Case Studies
  24. Glossary
  25. Trivia and History
  26. Flashcard Review
  27. Drills
  28. Cheat Sheet
  29. Self-Assessment

Part 1: The Boot Sequence

You press the power button. Here's everything that happens.

Stage 1: Firmware (BIOS/UEFI)

The power supply stabilizes voltage and sends a "Power Good" signal (~100-500ms). The CPU begins executing from a hardwired address — the reset vector (0xFFFFFFF0 on x86). At this moment: RAM isn't initialized, no storage exists, no operating system.

BIOS (the old way, 1981–2020):

Power on → POST (Power-On Self-Test) → Read first 512 bytes (MBR) → Jump to bootloader

The MBR is exactly 512 bytes: 440 bytes of code, 64 bytes of partition table (max 4 partitions, max 2TB disk), and the 0x55AA boot signature.

Trivia: The 0x55AA signature has been the same since the original IBM PC in 1981. Its bit pattern (01010101 10101010) alternates between 0 and 1, making it unlikely to occur randomly.

UEFI (the modern way, 2005+):

Power on → POST → Read NVRAM boot entries → Load EFI application from ESP → EFI app is bootloader

Feature BIOS UEFI
Partition table MBR (2TB max, 4 partitions) GPT (9.4 ZB max, 128 partitions)
Bootloader size 440 bytes in MBR Full binary on ESP (FAT32 partition)
Secure Boot No Yes — cryptographic chain of trust
Environment 16-bit, 1MB address space 32/64-bit, GiB of address space

Secure Boot chain: UEFI firmware → shimx64.efi (signed by Microsoft) → grubx64.efi (signed by distro) → vmlinuz (signed by distro). If any signature fails, boot halts.

Stage 2: GRUB — Loading the Kernel

GRUB2 (GRand Unified Bootloader) is a small operating system with filesystem drivers, a shell, and a scripting language.

# See current kernel command line
cat /proc/cmdline
# → BOOT_IMAGE=/vmlinuz-6.5.0-44-generic root=UUID=abc123... ro quiet splash

# See boot timing
systemd-analyze
# → Startup finished in 2.5s (firmware) + 3.1s (loader) + 1.8s (kernel) + 8.4s (userspace)

Key kernel command line parameters:

Parameter Purpose
root=UUID=... Where to find the root filesystem
ro Mount root read-only initially (for fsck)
single or 1 Boot to single-user (rescue) mode
init=/bin/bash Skip init entirely, drop to shell (emergency)
rd.break Break into initramfs shell before switch_root
console=ttyS0,115200 Serial console for headless servers

Gotcha: Never edit /boot/grub/grub.cfg directly — it's regenerated by update-grub. Edit /etc/default/grub instead.

Stage 3: Kernel Initialization

The kernel image (vmlinuz — the "z" means compressed) decompresses itself, then: 1. Detects CPU features and security mitigations 2. Builds the memory map and page tables 3. Configures interrupts 4. Enumerates PCI/PCIe devices (NICs, storage controllers, GPUs) 5. Initializes built-in drivers

# See kernel boot messages
dmesg | head -50
# → [0.000000] Linux version 6.5.0-44-generic ...
# → [0.123456] PCI: Using host bridge windows ...
# → [0.345678] nvme nvme0: pci function 0000:01:00.0

Stage 4: Initramfs — The Bridge to Root

The kernel needs to mount the root filesystem, but the root might be on LVM, LUKS encryption, software RAID, or an NVMe drive whose driver isn't compiled in. The initramfs (initial RAM filesystem) is a compressed CPIO archive containing just enough tools and drivers to find and mount the real root.

Initramfs (in RAM)          Real Root (on disk)
├── /init                   ├── /sbin/init → systemd
├── /bin/busybox            ├── /etc/
├── /lib/modules/           ├── /var/
└── /scripts/               └── /home/
        ↓ switch_root ↓

When initramfs can't find root:

ALERT! UUID=abc123... does not exist. Dropping to a shell!
(initramfs) _

Gotcha: If you change storage controllers (SATA→NVMe, new RAID card), rebuild initramfs before rebooting: update-initramfs -u (Debian) or dracut --force (RHEL). Otherwise: kernel panic.

Stage 5: PID 1 — systemd Takes Over

The kernel executes /sbin/init (symlink to systemd). PID 1 is special: - Can't be killed — kernel drops unhandled signals - If it exits, kernel panics — system is dead - Reaps orphans — cleans up processes whose parents died

# What took longest to boot?
systemd-analyze blame | head -10

# Critical path (bottleneck chain)
systemd-analyze critical-chain

Part 2: The Kernel

What Linux Actually Is

Linux is a kernel — the core of the operating system that controls CPU, memory, devices, and provides system calls. Everything else (bash, systemd, grep, nginx) is userspace software that talks to the kernel via syscalls.

[Your commands  ]  bash/zsh runs tools, pipes streams
[Userspace tools]  ps, ss, journalctl, find, ip, grep
[Libraries      ]  glibc, NSS, SSL, PAM
[Syscalls       ]  open/read/write/fork/exec/socket
[Kernel         ]  scheduler, VFS, network stack, drivers
[Hardware       ]  CPU, RAM, disk, NIC

Key Kernel Concepts

System calls (syscalls): The only way userspace can interact with hardware. Every file open, network connection, and process creation goes through a syscall.

Kernel modules: Drivers and features that can be loaded/unloaded without rebooting:

lsmod                           # List loaded modules
modprobe nvidia                 # Load a module
modinfo ext4                    # Module information

Kernel ring buffer: Hardware detection and driver messages from the moment the kernel starts:

dmesg -T | tail -50             # Recent kernel messages with timestamps
dmesg | grep -i error           # Find kernel errors

Kernel parameters (sysctl): Runtime-tunable kernel behavior:

sysctl -a | wc -l               # Hundreds of tunable parameters
sysctl net.ipv4.ip_forward      # Check IP forwarding
sysctl -w net.ipv4.ip_forward=1 # Enable (temporary)
# Persistent: add to /etc/sysctl.d/99-custom.conf


Part 3: systemd

systemd is the init system and service manager on virtually all modern Linux distributions. It replaced SysV init's sequential shell scripts with parallel, dependency-based service management.

Essential Commands

# Service management
systemctl status nginx          # Status + recent logs
systemctl start/stop/restart nginx
systemctl enable/disable nginx  # Boot persistence
systemctl enable --now nginx    # Enable AND start

# Finding problems
systemctl list-units --failed   # Failed services
systemctl list-units --type=service --state=running

# After editing unit files
systemctl daemon-reload

Unit Files

# /etc/systemd/system/myapp.service
[Unit]
Description=My Application
After=network.target postgresql.service
Wants=postgresql.service

[Service]
Type=simple
User=deploy
Group=deploy
WorkingDirectory=/opt/myapp
ExecStart=/opt/myapp/bin/server --port 8080
Restart=on-failure
RestartSec=5
Environment=NODE_ENV=production

# Resource limits
MemoryMax=512M
CPUQuota=200%

# Security hardening
NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/var/lib/myapp /var/log/myapp

[Install]
WantedBy=multi-user.target
Directive Meaning
After= Start after these units (ordering)
Wants= Soft dependency — start these, but don't fail if they fail
Requires= Hard dependency — fail if these fail
Type=simple Process stays in foreground (most common)
Type=forking Process forks and parent exits (legacy daemons)
Restart=on-failure Restart if exit code is non-zero
RestartSec=5 Wait 5 seconds between restarts
WantedBy=multi-user.target Enable means "start at boot"

Warning: network.target does NOT mean "the network is ready." It means the networking stack startup has been initiated. Use network-online.target only for services that truly need configured connectivity before starting. Most server daemons do NOT need it — they bind a socket and accept connections whenever they arrive.

Drop-in Overrides

Customize a unit without editing the original file:

# Create an override
systemctl edit nginx
# Creates /etc/systemd/system/nginx.service.d/override.conf

# Or manually:
mkdir -p /etc/systemd/system/nginx.service.d/
cat > /etc/systemd/system/nginx.service.d/override.conf << 'EOF'
[Service]
MemoryMax=1G
LimitNOFILE=65536
EOF
systemctl daemon-reload
systemctl restart nginx

Timers (Cron Replacement)

# /etc/systemd/system/backup.timer
[Unit]
Description=Run backup daily

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true

[Install]
WantedBy=timers.target
systemctl enable --now backup.timer
systemctl list-timers                  # See all timers

journald — Structured Logging

journalctl -u nginx -f                 # Follow service logs
journalctl -u nginx --since "1 hour ago"
journalctl -u nginx --since "2026-03-23 00:00" --until "2026-03-23 06:00"
journalctl -p err -b                   # Errors since boot
journalctl -k                          # Kernel messages only
journalctl --disk-usage                # Log storage used
journalctl --vacuum-size=500M          # Trim logs to 500MB
journalctl -o json-pretty -u nginx -n 1  # JSON output

Why systemd is controversial: It replaced a system (SysV init, 1983) that used simple shell scripts anyone could read. systemd is a complex binary that manages services, logging, networking, timers, hostname, locale, and more. Critics say it violates Unix philosophy ("do one thing well"). Supporters say it solved real problems: parallel boot, dependency management, process supervision, and resource isolation. The Debian vote in 2014 nearly split the project, and Devuan was forked specifically to maintain a systemd-free Debian.


Part 4: Processes and Signals

Process Lifecycle

fork() → new process (copy of parent)
exec() → replace process image with new program
wait() → parent collects child's exit status
exit() → process terminates

Every process has: - PID — unique process ID - PPID — parent process ID - UID/GID — owner - File descriptors — open files, sockets, pipes - Memory mappings — code, heap, stack, shared libraries - cgroup membership — resource limits

ps aux                                  # All processes
ps -eo pid,ppid,%cpu,%mem,cmd --sort=-%cpu | head
pstree -p                              # Process tree with PIDs

Process States

State Symbol Meaning
Running R Executing on CPU or runnable
Sleeping S Waiting for event (interruptible)
Disk sleep D Waiting for I/O (uninterruptible — can't be killed)
Zombie Z Exited but not yet reaped by parent
Stopped T Stopped by signal (Ctrl+Z)

Zombies: A process that has exited but whose parent hasn't called wait(). Zombies consume only a PID table entry but can exhaust the PID space.

# Find zombies
ps aux | awk '$8 == "Z"'
# Find their parent
ps -eo pid,ppid,stat,cmd | grep ' Z '

Orphans: When a parent dies, children are re-parented to PID 1 (systemd), which reaps them.

Signals

Signal Number Default Action Purpose
SIGHUP 1 Terminate Reload config (by convention)
SIGINT 2 Terminate Ctrl+C
SIGQUIT 3 Core dump Ctrl+\
SIGKILL 9 Terminate Cannot be caught or ignored
SIGTERM 15 Terminate Graceful shutdown (default kill)
SIGSTOP 19 Stop Ctrl+Z (cannot be caught)
SIGCONT 18 Continue Resume stopped process
SIGCHLD 17 Ignore Child process state changed

Mnemonic: "1 for Hangup, 15 for Terminate, 9 for Kill." Always try SIGTERM before SIGKILL — SIGTERM allows cleanup (flush buffers, close connections). SIGKILL is instant death.

kill PID                # Sends SIGTERM (15) by default
kill -9 PID             # SIGKILL — last resort
kill -HUP PID           # Reload config (nginx, sshd)
kill -0 PID             # Test if process exists (no signal sent)
killall nginx           # Kill all processes named nginx
pkill -f "python app"   # Kill by command pattern

Part 5: Permissions

The Permission Model

-rwxr-xr-- 1 deploy www-data 4096 Mar 23 14:00 app.py
│└┬┘└┬┘└┬┘   └──┬─┘ └──┬───┘
│ │  │  │       │      │
│ │  │  │     owner  group
│ │  │  └─ other: r-- (read only)
│ │  └──── group: r-x (read + execute)
│ └─────── user:  rwx (read + write + execute)
└───────── type: - (file), d (directory), l (symlink)

For files: r = read contents, w = write contents, x = execute as program

For directories: r = list contents, w = create/delete entries, x = traverse (enter the directory)

Gotcha: A directory without x permission lets you ls the names but not cd into it or access any files inside. This catches everyone at least once.

chmod 755 file          # rwxr-xr-x
chmod 644 file          # rw-r--r--
chmod u+x file          # Add execute for user
chmod -R g+w dir/       # Recursive group write
chown user:group file
chown -R deploy:deploy /opt/app/

Special Bits

Bit Octal On Files On Directories
SUID 4000 Run as file owner (ignored)
SGID 2000 Run as file group New files inherit directory's group
Sticky 1000 (ignored) Only file owner can delete (used on /tmp)
chmod u+s /usr/bin/passwd    # SUID — runs as root
chmod g+s /shared/           # SGID — inherit group
chmod +t /tmp/               # Sticky — only owner can delete
find / -perm -4000 -ls       # Find all SUID files

umask

Controls default permissions for new files:

umask              # Show current mask (e.g., 0022)
# File default:    0666 - 0022 = 0644 (rw-r--r--)
# Directory default: 0777 - 0022 = 0755 (rwxr-xr-x)

ACLs and Capabilities

# ACLs: fine-grained permissions beyond user/group/other
getfacl file
setfacl -m u:deploy:rx file

# Capabilities: grant specific root powers without full root
getcap /usr/bin/ping
# → cap_net_raw=ep
setcap cap_net_bind_service=+ep /opt/myapp/server

Part 6: The Filesystem

Everything Is a File

In Linux, almost everything is represented as a file: regular files, directories, devices, sockets, pipes, and even kernel state (/proc, /sys).

The Directory Hierarchy

Path Purpose
/ Root — everything starts here
/bin, /usr/bin Essential/user binaries
/sbin, /usr/sbin System administration binaries
/etc Configuration files
/var Variable data (logs, databases, mail, caches)
/tmp Temporary files (often cleared on boot)
/home User home directories
/root Root user's home
/proc Virtual filesystem — kernel/process state
/sys Virtual filesystem — hardware/driver state
/dev Device files
/boot Kernel and bootloader
/opt Optional/third-party software
/mnt, /media Mount points

Filesystem Internals

Inodes: Every file has an inode — a metadata record containing mode, ownership, timestamps, size, and block pointers. The filename is stored in the directory entry, not the inode.

ls -i file                  # Show inode number
stat file                   # Full inode details
df -i                       # Inode usage per filesystem

Gotcha: df -h shows space is available, but writes fail? Check df -i — inodes might be exhausted. This happens with millions of tiny files (session stores, mail queues).

Hard links vs symlinks: - Hard link: Another name pointing to the same inode. Deleting one name doesn't affect the other. Can't cross filesystems. - Symlink: A pointer to a path. Can break if the target is deleted. Can cross filesystems.

ln file hardlink            # Hard link
ln -s file symlink          # Symbolic link

VFS — The Abstraction Layer

The Virtual Filesystem Switch lets Linux use the same syscalls (open, read, write) across all filesystems: ext4, XFS, tmpfs, NFS, overlayfs, procfs. Applications don't need to know which filesystem they're on.

Filesystem Types

Filesystem Use Case Max File Size Journal Notes
ext4 General purpose (default on Debian/Ubuntu) 16 TB Yes Mature, well-tested
XFS Large files, high throughput (default on RHEL) 8 EB Yes Excellent at scale
Btrfs Snapshots, checksums, compression 16 EB CoW Modern, more features
tmpfs RAM-backed temporary files RAM size No /tmp, /run
overlayfs Container image layers Varies No Used by Docker

Caveat: XFS can grow online but cannot be shrunk — ever. ext4 can be shrunk offline. This matters when planning storage.


Part 7: Storage

Block Devices and Partitions

lsblk                       # Block device tree
lsblk -f                    # With filesystem info
fdisk -l                    # Partition tables
blkid                       # UUID and filesystem types

LVM — Logical Volume Manager

LVM adds a virtualization layer between physical disks and filesystems:

Physical Disks → Physical Volumes (PV) → Volume Group (VG) → Logical Volumes (LV) → Filesystems
/dev/sda1 ──→ PV ─┐
                   ├──→ VG "data" ──→ LV "app" (ext4)
/dev/sdb1 ──→ PV ─┘               ──→ LV "logs" (xfs)
# Create
pvcreate /dev/sdb1
vgcreate data /dev/sdb1
lvcreate -L 50G -n app data
mkfs.ext4 /dev/data/app

# Extend (online!)
lvextend -L +20G /dev/data/app
resize2fs /dev/data/app         # ext4
xfs_growfs /mountpoint          # XFS

# Status
pvs                              # Physical volumes
vgs                              # Volume groups
lvs                              # Logical volumes

RAID Levels

Level Disks Redundancy Speed Use Case
RAID 0 2+ None Fastest Scratch/temp
RAID 1 2 Mirror Read fast Boot, small critical
RAID 5 3+ 1 disk failure Good General purpose
RAID 6 4+ 2 disk failures Good Large arrays
RAID 10 4+ Mirror + stripe Excellent Databases, high I/O

Disk Health

smartctl -a /dev/sda             # SMART data (health, errors, hours)
smartctl -t short /dev/sda       # Run short self-test
iostat -xz 1 5                   # I/O statistics per device
iotop                            # Per-process I/O usage

Mount Operations

mount /dev/sda1 /mnt             # Mount
umount /mnt                      # Unmount
mount -o remount,rw /            # Remount with different options
findmnt                          # Mount tree
cat /etc/fstab                   # Persistent mounts

Mount options that matter for security:

Option Purpose
noexec Prevent execution of binaries
nosuid Ignore SUID/SGID bits
nodev Ignore device files
ro Read-only

LUKS (Linux Unified Key Setup) provides block-device encryption. Commonly used for full-disk encryption, unlocked during initramfs before root mount.


Part 8: Memory Management

The Big Picture

Linux intentionally uses ALL available RAM — unused RAM is wasted RAM. "Free" memory isn't a goal; healthy reclaim is.

free -h
#               total   used   free   shared  buff/cache  available
# Mem:           16G    4.2G   512M    128M      11G        11G
# Swap:          4G     0B     4G

MemAvailable (not MemFree) is what matters — it includes reclaimable cache.

Memory Types

Type Purpose Reclaimable?
Anonymous Process heap, stack, mmap Only to swap
Page cache File data cached in RAM Yes (automatically)
Slab cache Kernel data structures (dentries, inodes) Partially
Shared Shared memory segments, tmpfs Depends
Kernel Kernel code and data No

Virtual Memory

Every process gets its own virtual address space. The kernel maps virtual addresses to physical pages through page tables. This provides: - Isolation between processes - Lazy allocation (memory isn't physically allocated until used) - Copy-on-write after fork() - Memory-mapped files

cat /proc/PID/maps               # Virtual memory regions
cat /proc/PID/smaps_rollup       # Memory usage summary
pmap PID                         # Process memory map

The OOM Killer

When the system runs out of memory and swap, the kernel's OOM (Out Of Memory) killer selects a process to terminate based on oom_score.

# Check OOM kills
dmesg -T | grep -i "oom\|killed process"
journalctl -k | grep -i oom

# See OOM scores (higher = more likely to be killed)
cat /proc/PID/oom_score

# Protect a process from OOM killer
echo -1000 > /proc/PID/oom_score_adj

# Per-process memory usage
ps aux --sort=-%mem | head -15

Swap

Swap is overflow storage for when physical RAM is exhausted. Pages are moved to disk to free RAM for active use.

swapon --show                    # Active swap areas
cat /proc/swaps                  # Same
sysctl vm.swappiness             # How aggressively to swap (0-100)
# 0 = only swap to avoid OOM
# 60 = default
# Lower values prefer dropping page cache

Gotcha: Swap on SSD is fine and much faster than spinning disk. Swap on NVMe is fast enough to be nearly transparent. But ANY swapping means you're under memory pressure — investigate the cause.


Part 9: Networking Fundamentals

IP Configuration

ip addr show                     # IP addresses
ip route show                    # Routing table
ip link show                     # Network interfaces
ip neigh show                    # ARP table

# Legacy commands (still common)
ifconfig                         # IP addresses (deprecated)
route -n                         # Routing table (deprecated)

DNS

DNS resolution path: Application → NSS (Name Service Switch) → resolver (systemd-resolved or direct) → DNS server. Test what the system resolves (not just DNS): getent hosts example.com

dig example.com +short           # DNS lookup
dig example.com @8.8.8.8         # Query specific server
dig -x 93.184.216.34             # Reverse lookup
host example.com                 # Simple lookup
getent hosts example.com         # What the system resolves to (includes /etc/hosts)
cat /etc/resolv.conf             # DNS configuration
resolvectl status                # systemd-resolved state
getent hosts example.com         # What the system actually resolves
dig example.com +short           # Direct DNS query (bypasses NSS)

TCP/IP Debugging

ss -tlnp                         # Listening TCP ports with process names
ss -s                            # Socket statistics summary
ss -tn state established         # Established connections
ss -tn state time-wait | wc -l   # Count TIME_WAIT

# Connectivity testing
ping host                        # ICMP reachability
traceroute host                  # Path to host
curl -v telnet://host:port       # TCP connectivity test
nc -zv host 80                   # Quick port check
tcpdump -i eth0 port 80          # Packet capture

TCP States You Need to Know

State Meaning Concern
LISTEN Waiting for connections Normal for servers
ESTABLISHED Active connection Normal
TIME_WAIT Connection closed, waiting to expire High count = many short connections
CLOSE_WAIT Remote closed, local hasn't closed yet Bug — application not closing sockets
SYN_SENT Connection attempt in progress High count = upstream unreachable

Gotcha: Many CLOSE_WAIT sockets = application bug. The remote side closed the connection but your application hasn't called close(). This causes file descriptor leaks.


Part 10: Firewalls — iptables

The Five Chains

Every packet goes through netfilter hooks where iptables rules are evaluated:

Incoming → PREROUTING → Routing decision → INPUT (for this host)
                                        → FORWARD (passing through)
Outgoing ← POSTROUTING ← OUTPUT (from this host)
Chain When Purpose
PREROUTING Before routing DNAT (change destination)
INPUT Packets for this host Firewall: allow/deny incoming
FORWARD Packets passing through Router/Docker/K8s
OUTPUT Packets from this host Control outgoing
POSTROUTING After routing SNAT/MASQUERADE (change source)

Rules evaluate top to bottom. First match wins.

iptables -L -n -v --line-numbers  # List all rules
iptables -t nat -L -n -v          # NAT rules

# Basic firewall
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A INPUT -i lo -j ACCEPT
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
iptables -A INPUT -j DROP

# Save/restore
iptables-save > /tmp/rules.bak
iptables-restore < /tmp/rules.bak

Gotcha: Adding a DROP rule before your ACCEPT rules locks you out of SSH immediately on a remote server. Always put ESTABLISHED,RELATED first, then SSH, then DROP.

History: iptables has gone through four generations: ipfwadm (1994) → ipchains (1998) → iptables (2001) → nftables (2014). iptables remains more widely used because Docker, Kubernetes, fail2ban, and UFW all generate iptables rules.

nftables — The Modern Framework

nftables is the official successor to iptables (Linux 3.13+, 2014). It provides a unified syntax for IPv4/IPv6/ARP filtering, replacing the separate iptables/ip6tables/arptables/ebtables commands.

  • nft list ruleset to see all rules
  • Many distros still use iptables compatibility layers (Docker, K8s, fail2ban emit iptables rules)
  • On modern RHEL/Fedora, firewall-cmd wraps nftables
nft list ruleset
firewall-cmd --list-all

Part 11: SSH

What Happens When You Type ssh server

  1. TCP connection to port 22
  2. Key exchange (Diffie-Hellman) — creates shared secret without transmitting it
  3. Server authentication — host key verified against ~/.ssh/known_hosts
  4. User authentication — public key, password, or certificate

Etymology: SSH was created in 1995 by Tatu Ylönen after a password-sniffing attack at Helsinki University. He chose port 22 because it sat between FTP (21) and Telnet (23) — the protocols SSH replaced. He emailed IANA and got the port assigned the same day.

Key Types

Type Recommendation Notes
Ed25519 Use this Fastest, most secure, smallest keys
RSA Legacy, still works Needs 4096-bit for security
ECDSA Acceptable Ed25519 is better
DSA Never Deprecated (broken at 1024-bit)
ssh-keygen -t ed25519 -C "deploy@company"
ssh-copy-id user@host

SSH Config — The Secret Weapon

# ~/.ssh/config
Host prod-*
    ProxyJump bastion.example.com
    User deploy
    IdentityFile ~/.ssh/deploy_ed25519

Host db-primary
    HostName 10.0.2.50
    Port 2222
    User postgres

SSH Tunneling

# Local forward: access remote service through local port
ssh -L 8080:localhost:80 user@remote
# Now localhost:8080 → remote's localhost:80

# Remote forward: expose local service to remote
ssh -R 8080:localhost:3000 user@remote

# SOCKS proxy: tunnel all traffic
ssh -D 1080 user@remote

# ProxyJump: SSH through bastion
ssh -J bastion.example.com internal-server

Agent Forwarding

eval $(ssh-agent)
ssh-add ~/.ssh/id_ed25519
ssh -A bastion                   # Forward agent to bastion
# Now from bastion, you can SSH to internal hosts using your local key

Security warning: Agent forwarding on untrusted hosts lets root on that host use your key. Prefer ProxyJump instead.


Part 12: The /proc Filesystem

/proc is a virtual filesystem that exposes kernel state as files. Every debugging tool (ps, top, free, lsof) reads from /proc.

Per-Process: /proc/PID/

cat /proc/$$/cmdline | tr '\0' ' '       # Command line
ls -la /proc/$$/exe                       # Binary path
ls -la /proc/$$/cwd                       # Working directory
cat /proc/$$/environ | tr '\0' '\n'       # Environment variables
cat /proc/$$/status                       # State, memory, threads
ls -la /proc/$$/fd/                       # Open file descriptors
cat /proc/$$/maps                         # Memory regions

Gotcha: /proc/PID/environ shows ALL environment variables — including DATABASE_URL and API_KEY. Anyone with access (same user or root) can read your secrets.

System-Wide

cat /proc/meminfo                # Memory details
cat /proc/cpuinfo                # CPU information
cat /proc/loadavg                # Load averages
cat /proc/uptime                 # Uptime in seconds
cat /proc/net/tcp                # TCP connections (hex)
cat /proc/sys/kernel/pid_max     # Max PID value

Practical: Deleted Files Still Using Space

# Find files deleted from disk but still held open by processes
find /proc/*/fd -ls 2>/dev/null | grep deleted
# The space won't be freed until the process closes the file or exits

Part 13: Debugging with strace

strace shows every system call a process makes — every file opened, byte read/written, network connection created.

# Trace a running process
strace -p 12345

# Trace with timing
strace -p 12345 -t -T
# -t = timestamp, -T = time spent in each syscall

# Trace specific syscalls
strace -e trace=open,read,write ./myapp
strace -e trace=network ./myapp
strace -e trace=file ./myapp

# Follow child processes
strace -f ./deploy.sh

Pattern: The Stuck Process

strace -p 12345
# → read(5, [hangs here]
# Process is blocked on read from fd 5
ls -la /proc/12345/fd/5
# → socket:[89012] — waiting on a database response

Pattern: The Slow Startup

strace -T -e trace=open,connect ./myapp 2>&1 | sort -t'<' -k2 -rn | head
# → connect(3, {...5432...}) = 0 <5.012>   ← 5 seconds to database!

Pattern: Permission Denied

strace -e trace=open,stat,access ./myapp 2>&1 | grep EACCES
# → openat(AT_FDCWD, "/var/lib/myapp/data.db", O_RDWR) = -1 EACCES

Part 14: Performance Triage

The USE Method

For each resource, check Utilization, Saturation, Errors:

Resource Utilization Saturation Errors
CPU uptime (load avg), mpstat Run queue length dmesg
Memory free -h, vmstat Swap activity, OOM dmesg \| grep oom
Disk iostat -xz, df -h await in iostat dmesg \| grep error
Network sar -n DEV, ss -s Overflows, drops nstat, ip -s link

Quick Triage Sequence

uptime                           # Load averages
dmesg -T | tail -20              # Recent kernel messages
free -h                          # Memory
df -h                            # Disk space
df -i                            # Inodes
iostat -xz 1 3                   # Disk I/O
ss -s                            # Socket summary
ps aux --sort=-%cpu | head -10   # Top CPU consumers
ps aux --sort=-%mem | head -10   # Top memory consumers

Load Average Decoded

uptime
# → load average: 4.50, 3.20, 2.10
#                 1min  5min  15min

Load average = number of runnable + uninterruptibly sleeping processes. Compare to CPU count (nproc): - Load < CPU count: system has headroom - Load = CPU count: fully utilized - Load > 2× CPU count: significant saturation

Gotcha: High load with low CPU% often means I/O wait — processes blocked on disk. Check iostat -xz 1.


Part 15: Logging

Log Locations

/var/log/syslog (or /var/log/messages)  — System log
/var/log/auth.log                        — Authentication events
/var/log/kern.log                        — Kernel messages
/var/log/nginx/access.log               — Web server access
/var/log/nginx/error.log                — Web server errors

journalctl Essentials

journalctl -u nginx -f                  # Follow service logs
journalctl -u nginx --since "1 hour ago"
journalctl -p err -b                    # Errors since boot
journalctl -xe                          # Recent errors with context
journalctl -k                           # Kernel messages
journalctl --vacuum-size=500M           # Trim logs

logrotate

# /etc/logrotate.d/myapp
/var/log/myapp/*.log {
    daily
    rotate 14
    compress
    delaycompress
    missingok
    notifempty
    postrotate
        systemctl reload myapp
    endscript
}

Part 16: Package Management

Debian/Ubuntu (apt/dpkg)

apt update                       # Refresh package index
apt upgrade                      # Upgrade all packages
apt install nginx                # Install
apt remove nginx                 # Remove (keep config)
apt purge nginx                  # Remove with config
apt search keyword               # Search
dpkg -l | grep nginx             # Check installed
dpkg -L nginx                    # List files from package
apt-cache policy nginx           # Version/repo info

RHEL/Fedora/CentOS (dnf/rpm)

dnf install nginx
dnf remove nginx
dnf search keyword
dnf list installed | grep nginx
rpm -qa | grep nginx             # Check installed
rpm -ql nginx                    # List files
dnf info nginx                   # Package details

Part 17: Text Processing

The Pipeline Philosophy

# Chain tools with pipes
cat access.log | grep "500" | awk '{print $1}' | sort | uniq -c | sort -rn | head -10

grep — Find Lines

grep "ERROR" /var/log/syslog
grep -i "error" file             # Case-insensitive
grep -r "TODO" /src/             # Recursive
grep -c "500" access.log         # Count matches
grep -v "DEBUG" file             # Invert (exclude)
grep -E "error|warning" file     # Extended regex (OR)
grep -A 3 "FATAL" file           # 3 lines after match
grep -B 2 "FATAL" file           # 2 lines before

awk — Field Processing

awk '{print $1}' access.log                    # First field
awk -F: '{print $1, $7}' /etc/passwd           # Custom delimiter
awk '$9 >= 500 {print $1, $7, $9}' access.log  # Filter by field value
awk '{sum+=$10} END {print sum}' access.log     # Sum a column

sed — Stream Editing

sed 's/old/new/g' file           # Replace all occurrences
sed -i 's/old/new/g' file        # In-place edit
sed -n '10,20p' file             # Print lines 10-20
sed '/pattern/d' file            # Delete matching lines

Other Essential Tools

sort file                        # Sort lines
sort -rn file                    # Reverse numeric sort
uniq -c                          # Count duplicates (requires sorted input)
wc -l file                       # Count lines
cut -d: -f1 /etc/passwd          # Extract fields
tr 'a-z' 'A-Z'                  # Translate characters
head -20 file                    # First 20 lines
tail -f file                     # Follow file growth
tee file                         # Write to file AND stdout
xargs                            # Build commands from stdin

Part 18: cgroups and Namespaces

cgroups — Resource Control

cgroups limit, account for, and isolate resource usage (CPU, memory, I/O) of process groups. This is the foundation of container resource limits.

Note: Modern distributions use cgroup v2 (unified hierarchy) by default. cgroup v1 used separate hierarchies per controller. The commands shown here work on both, but systemd on modern kernels uses v2 exclusively.

# See cgroup hierarchy
systemd-cgtop                    # Live cgroup resource usage
cat /sys/fs/cgroup/memory/docker/CONTAINER_ID/memory.usage_in_bytes
cat /proc/PID/cgroup             # Which cgroup a process belongs to

# systemd sets cgroups via unit file directives:
# MemoryMax=512M
# CPUQuota=200%

Namespaces — Isolation

Namespaces provide isolated views of system resources:

Namespace Isolates Container Use
PID Process IDs Container sees own PID 1
Network Network stack Container gets own IP
Mount Filesystem mounts Container sees own root
UTS Hostname Container has own hostname
User UID/GID mappings Rootless containers
IPC IPC objects Isolated shared memory
# Create a network namespace
ip netns add test
ip netns exec test ip addr show
# → only loopback exists in this namespace

# See namespaces of a process
ls -la /proc/PID/ns/

cgroups + namespaces = containers. Docker, Kubernetes, and LXC all use these kernel features.


Part 19: Security Hardening

SSH Hardening

# /etc/ssh/sshd_config
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers deploy admin
MaxAuthTries 3
ClientAliveInterval 300
ClientAliveCountMax 2

Firewall Basics

# UFW (Uncomplicated Firewall — Ubuntu)
ufw enable
ufw default deny incoming
ufw allow ssh
ufw allow 80/tcp
ufw allow 443/tcp
ufw status verbose

SELinux (RHEL/CentOS)

getenforce                       # Current mode: Enforcing/Permissive/Disabled
setenforce 0                     # Set permissive (temporary)
ausearch -m avc -ts recent       # Recent denials
sealert -a /var/log/audit/audit.log  # Human-readable alerts
restorecon -Rv /var/www/         # Fix file contexts

Kernel Hardening (sysctl)

# /etc/sysctl.d/99-hardening.conf
net.ipv4.conf.all.rp_filter = 1          # Reverse path filtering
net.ipv4.conf.all.accept_redirects = 0   # Ignore ICMP redirects
net.ipv4.conf.all.send_redirects = 0
net.ipv4.tcp_syncookies = 1              # SYN flood protection
kernel.dmesg_restrict = 1                # Restrict dmesg access
fs.protected_hardlinks = 1               # Prevent hardlink attacks
fs.protected_symlinks = 1                # Prevent symlink attacks

Audit

auditctl -w /etc/passwd -p wa -k passwd_changes  # Watch file changes
ausearch -k passwd_changes                         # Search audit log

Part 20: eBPF — The Linux Superpower

eBPF lets you run sandboxed programs inside the Linux kernel without changing kernel code or loading kernel modules. It's used for networking (Cilium), security (Falco, Tetragon), and observability (bpftrace, bcc).

# bpftrace one-liners
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
# → Trace every file open across the entire system

bpftrace -e 'tracepoint:syscalls:sys_enter_connect { printf("%s connecting\n", comm); }'
# → Trace every network connection

# bcc tools (pre-built eBPF tools)
execsnoop                        # Trace new processes
opensnoop                        # Trace file opens
biolatency                       # Block I/O latency histogram
tcpconnect                       # Trace outbound TCP connections

History: BPF (Berkeley Packet Filter) was created in 1992 for packet filtering. In 2014, Alexei Starovoitov extended it into eBPF — a general-purpose in-kernel virtual machine. The kernel verifier ensures eBPF programs can't crash the kernel, loop forever, or access invalid memory.


Part 21: Linux Distributions

Distro Family Examples Package Manager Default FS Use Case
Debian Debian, Ubuntu, Mint apt/dpkg ext4 Servers, desktops, cloud
Red Hat RHEL, CentOS, Fedora, Rocky, Alma dnf/rpm XFS Enterprise, compliance
Arch Arch, Manjaro pacman ext4 Rolling release, DIY
Alpine Alpine apk ext4 Containers (tiny, musl)
SUSE openSUSE, SLES zypper/rpm Btrfs Enterprise, snapshots

Key difference: Debian-family uses /etc/apt/, .deb packages, systemctl. Red Hat-family uses /etc/yum.repos.d/, .rpm packages, systemctl. The core Linux is the same — differences are in packaging, default configs, and support models.


Part 22: On-Call Survival Guide

Disk Full

df -h                                          # Which filesystem is full?
du -sh /var/log/* | sort -rh | head -10        # Biggest log directories
journalctl --vacuum-size=500M                  # Trim journal
find /var -xdev -type f -size +100M            # Large files
lsof +L1                                       # Deleted but still open files

OOM Killer

dmesg -T | grep -i "oom\|killed process"       # What was killed?
free -h                                         # Current memory state
ps aux --sort=-%mem | head -15                  # Memory hogs

Service Failed

systemctl status SERVICE                        # State + recent logs
journalctl -u SERVICE -n 50 --no-pager          # Full error output
ss -tlnp | grep PORT                            # Port conflict?
systemctl restart SERVICE                       # Try restart

High Load

uptime && nproc                                 # Load vs CPU count
top -bn1 | head -25                             # Process overview
iostat -xz 1 3                                  # I/O wait?
iotop -a -b -n 3 | head -20                     # Which process?

Safe vs Dangerous Actions

Safe (do without asking) Dangerous (get approval)
Read df, top, ps, dmesg, free kill -9 any process
journalctl (read logs) Restart critical services
lsof, ss (read sockets) Delete files to free space
Journal vacuum docker/crictl prune
systemctl status Reboot the host

Part 23: Real-World Case Studies

Case 1: OOM Killer Takes Down the App

Symptom: Java application crashes every few hours. No application error logs.

Investigation: dmesg | grep oom reveals the kernel OOM killer terminating the Java process. The JVM heap was configured at 4GB on a 4GB host — leaving no room for the kernel, page cache, or other processes.

Fix: Set JVM heap to 75% of available memory. Add MemoryMax= to the systemd unit. Monitor with /proc/PID/status VmRSS.

Case 2: Disk "Full" but df Shows Space

Symptom: Application can't create new files. df -h shows 60% used.

Investigation: df -i shows 100% inode usage. The mail spool (/var/spool/mail/) contained 2 million tiny files — one per unread notification. Each file used one inode even though it was only a few bytes.

Fix: Clean up mail spool. Move to a filesystem with more inodes or use mkfs -N to specify inode count.

Case 3: Zombie Processes Filling PID Space

Symptom: fork() fails with EAGAIN. New processes can't start.

Investigation: ps aux | grep Z | wc -l shows 15,000 zombie processes. A poorly written monitoring script spawned child processes but never called wait(). Dead children accumulated as zombies until PID space was exhausted.

Fix: Fix the parent process to reap children. Kill the parent (orphaned zombies are adopted and reaped by PID 1). Increase kernel.pid_max as temporary relief.

Case 4: systemd Service Flapping

Symptom: Service starts, runs for 2 seconds, crashes, restarts, repeats. Eventually hits start-limit-hit.

Investigation: journalctl -u myapp shows the app exits immediately with "config file not found." The config file exists at /etc/myapp/config.yaml, but systemd runs the service with WorkingDirectory=/opt/myapp, and the app uses a relative path ./config.yaml.

Fix: Use absolute paths in the app config, or set WorkingDirectory= correctly. Reset the start limit: systemctl reset-failed myapp.

Case 5: Runaway Logs Fill Root Disk

Symptom: System becomes unresponsive. SSH login is slow, commands fail with "No space left on device."

Investigation: df -h shows / at 100%. du -sh /var/log/* | sort -rh reveals a 45GB access.log. Logrotate was configured but the cron job wasn't running (cron service was disabled during a security audit and never re-enabled).

Fix: truncate -s 0 /var/log/nginx/access.log (safer than rm — avoids the deleted-but-open-file problem). Re-enable cron. Separate /var on its own partition to prevent log growth from bricking the root filesystem.

Case 6: Kernel Soft Lockup

Symptom: dmesg shows BUG: soft lockup - CPU#3 stuck for 22s! Intermittent system freezes.

Investigation: A kernel module (buggy storage driver) holds a spinlock for too long, preventing the soft lockup watchdog from running. The kernel reports it but doesn't crash (soft lockup, not hard lockup).

Fix: Update or replace the problematic driver. As temporary mitigation, increase kernel.softlockup_panic threshold or disable the affected hardware.


Glossary

Term Definition
Kernel Core of the OS — controls CPU, memory, devices, provides syscalls
Syscall Interface between userspace and kernel (open, read, write, fork, exec)
PID Unique process identifier
PID 1 Init process (systemd) — parent of all processes, kernel panics if it exits
Inode File metadata record (permissions, timestamps, block pointers) — not the filename
File descriptor (FD) Number referencing an open file/socket/pipe (0=stdin, 1=stdout, 2=stderr)
VFS Virtual Filesystem Switch — unified interface across all filesystem types
Page cache RAM used to cache file data — automatically reclaimed under pressure
cgroup Control group — limits CPU, memory, I/O for a group of processes
Namespace Isolation boundary (PID, network, mount, user) — foundation of containers
Unit systemd-managed object: service, socket, timer, mount, target
Target systemd equivalent of runlevel — group of units (multi-user.target, graphical.target)
Initramfs Temporary RAM filesystem for finding and mounting the real root
GRUB GRand Unified Bootloader — loads the kernel from disk
ESP EFI System Partition — FAT32 partition with bootloader binaries
MBR Master Boot Record — 512-byte boot sector (legacy, max 2TB)
GPT GUID Partition Table — modern, supports 128 partitions and huge disks
LVM Logical Volume Manager — virtualization layer between disks and filesystems
RAID Redundant Array of Independent Disks — mirroring/striping for reliability/speed
Swap Disk area used as overflow when RAM is exhausted
OOM killer Kernel mechanism that kills processes when memory is completely exhausted
SIGTERM Signal 15 — graceful shutdown request (catchable)
SIGKILL Signal 9 — immediate termination (cannot be caught)
Zombie Process that has exited but parent hasn't called wait() — consumes only PID entry
Orphan Process whose parent died — adopted by PID 1
iowait CPU time spent waiting for I/O completion — suggests storage bottleneck
Load average Number of runnable + uninterruptibly sleeping processes
umask Default permission mask for newly created files
SUID Set User ID — executable runs as file owner (e.g., passwd runs as root)
SELinux Security-Enhanced Linux — mandatory access control system
AppArmor Application Armor — path-based mandatory access control
eBPF Extended Berkeley Packet Filter — sandboxed kernel programs for observability
strace Traces system calls between a process and the kernel
iptables Packet filtering framework using netfilter hooks
SSH Secure Shell — encrypted protocol for remote access (port 22)
TOFU Trust On First Use — SSH's security model for host verification
TTY Terminal device — from TeleTYpewriter, now virtual terminal

Trivia and History

  1. BIOS survived 40 years. Created by Gary Kildall for CP/M in 1975, adopted by IBM in 1981. UEFI didn't fully replace it until around 2020.

  2. GRUB is a small operating system. It has filesystem drivers, a shell, a scripting language, and a network stack. The name is a physics reference — "Grand Unified" like a Grand Unified Theory.

  3. The kernel decompresses itself. vmlinuz (the "z" = compressed) contains a decompression stub that unpacks the real kernel. The compressed image is ~12MB; uncompressed is 30-50MB.

  4. PID 1 is unkillable. The kernel drops unhandled signals sent to PID 1. If PID 1 exits, the kernel panics. In containers, this causes problems — docker stop sends SIGTERM, but if your app is PID 1 and doesn't handle it, Docker waits 10 seconds then SIGKILLs.

  5. systemd is the most controversial Linux project. Announced by Lennart Poettering in 2010, it replaced SysV init (1983). The Debian vote in 2014 nearly split the project. Devuan was forked specifically to maintain systemd-free Debian.

  6. The rc in rc.d stands for "run commands." From AT&T System V Unix (1983). The S01/S02 numbering convention used shell globbing for sequencing — it worked for 30 years.

  7. systemd can boot in under 2 seconds. Parallel service startup, socket activation, and aggressive dependency management on SSD hardware. systemd-analyze shows exactly where time is spent.

  8. SSH was born from a password-sniffing attack. Tatu Ylönen wrote it in 1995 after thousands of plaintext Telnet/FTP passwords were captured at Helsinki University. Port 22 was unassigned and sat between FTP (21) and Telnet (23).

  9. Ed25519 was designed by the SYN cookies inventor. Daniel J. Bernstein designed it to resist timing side-channel attacks. Public key is 68 characters vs 372+ for RSA.

  10. The 0x55AA boot signature is from 1981. Every x86 machine still checks for these two bytes at offset 510-511 to decide if a disk is bootable.

  11. iptables has four generations. ipfwadm (1994) → ipchains (1998) → iptables (2001) → nftables (2014). iptables remains dominant because Docker, Kubernetes, fail2ban, and UFW all generate iptables rules.

  12. eBPF started as a packet filter in 1992. Berkeley Packet Filter was extended in 2014 by Alexei Starovoitov into a general-purpose in-kernel virtual machine. Now used for networking (Cilium), security (Falco), and observability (bpftrace).

  13. Linux uses ALL your RAM on purpose. Unused RAM is wasted RAM. Linux fills it with page cache (file data). MemAvailable in free -h is what actually matters, not MemFree.

  14. /proc was just for processes originally. The name means "process." Linux stuffed more and more kernel state into it over time. Plan 9 (Bell Labs, 1992) took the concept further — everything is a file, including the CPU and network stack.

  15. Inodes run out before disk space. A filesystem with 0% disk used but 100% inodes used refuses to create new files. This happens with millions of tiny files (session stores, mail queues).


Flashcard Review

Boot and Kernel

Q A
What are the boot stages in order? Firmware (BIOS/UEFI) → GRUB → Kernel → Initramfs → PID 1 (systemd)
What is initramfs for? Temporary RAM filesystem with drivers/tools to find and mount the real root
What happens if PID 1 exits? Kernel panic — system halts
What is the kernel command line? Parameters passed to the kernel by GRUB (cat /proc/cmdline)
How do you boot to rescue mode? Edit GRUB entry, add systemd.unit=rescue.target or single

Processes and Signals

Q A
SIGTERM vs SIGKILL? SIGTERM (15) = graceful, catchable. SIGKILL (9) = immediate, uncatchable
What is a zombie process? Exited process whose parent hasn't called wait() — consumes only a PID entry
What does kill -0 PID do? Tests if process exists without sending a signal
What does load average measure? Number of runnable + uninterruptibly sleeping processes

Filesystems and Storage

Q A
What is an inode? File metadata record (permissions, timestamps, block pointers). Filename is in the directory entry.
df shows space but writes fail — why? Inode exhaustion (df -i), read-only remount, or quota
Hard link vs symlink? Hard link = same inode, can't cross filesystems. Symlink = path pointer, can break
What does LVM add? Virtualization layer: resize, snapshot, span disks without filesystem changes

Memory

Q A
What matters more: MemFree or MemAvailable? MemAvailable — includes reclaimable cache
What is the OOM killer? Kernel kills processes when memory + swap are exhausted
What does swap do? Moves inactive memory pages to disk to free RAM

Networking and Security

Q A
Many CLOSE_WAIT sockets — what's wrong? Application bug — not closing connections after remote closes
iptables rule order — what matters? First match wins. Put ACCEPT before DROP
SSH key type to use? Ed25519 — fastest, most secure, smallest keys
yaml.safe_load() vs yaml.load()? safe_load prevents code execution — always use it
What does noexec mount option do? Prevents execution of binaries on that filesystem

Debugging

Q A
Process stuck — first tool? strace -p PID to see what syscall it's blocked on
High load, low CPU — what to check? I/O wait: iostat -xz 1
Quick triage sequence? uptimedmesgfreedfiostatsstop
What does /proc/PID/fd/ show? All open file descriptors (files, sockets, pipes)

Drills

Drill 1: /proc Exploration (Easy)

Q: Find the command line, working directory, and environment variables of your current shell.

Answer
cat /proc/$$/cmdline | tr '\0' ' '
ls -la /proc/$$/cwd
cat /proc/$$/environ | tr '\0' '\n' | head
cat /proc/$$/status | grep -E "Name|State|Pid|VmRSS"

Drill 2: Open File Descriptors (Easy)

Q: Find all open file descriptors for a process. Find deleted-but-still-open files system-wide.

Answer
ls -la /proc/PID/fd/
ls /proc/PID/fd/ | wc -l
# Deleted files still holding disk space:
find /proc/*/fd -ls 2>/dev/null | grep deleted

Drill 3: Socket States (Easy)

Q: List all listening TCP ports with process names. Count TIME_WAIT connections.

Answer
ss -tlnp
ss -tn state time-wait | wc -l
ss -tn state close-wait          # Bug indicator
ss -s                            # Summary

Drill 4: Find Disk Hogs (Easy)

Q: Find what's consuming disk space. Check inode usage.

Answer
df -h
du -sh /* 2>/dev/null | sort -rh | head -10
du -sh /var/log/* | sort -rh | head -10
df -i                            # Inode usage
find /var -xdev -type f -size +100M   # Large files

Drill 5: systemd Override (Medium)

Q: Add memory limits to nginx without editing the original unit file.

Answer
systemctl edit nginx
# Add:
# [Service]
# MemoryMax=1G
# LimitNOFILE=65536

# Or manually:
mkdir -p /etc/systemd/system/nginx.service.d/
echo -e "[Service]\nMemoryMax=1G" > /etc/systemd/system/nginx.service.d/override.conf
systemctl daemon-reload
systemctl restart nginx
systemctl show nginx -p MemoryMax

Drill 6: journalctl Filtering (Medium)

Q: Find all errors from nginx in the last hour. Show kernel OOM messages since boot.

Answer
journalctl -u nginx -p err --since "1 hour ago"
journalctl -k | grep -i oom
journalctl -u nginx -o json-pretty -n 1

Drill 7: strace a Stuck Process (Medium)

Q: A process is using 0% CPU but is listed as running. Diagnose what it's waiting for.

Answer
strace -p PID
# Shows the syscall it's blocked on (e.g., read, connect, futex)
# Check what the file descriptor points to:
ls -la /proc/PID/fd/N
# If it's a socket, find the remote:
ss -tnp | grep PID

Drill 8: Process Tree and Zombies (Medium)

Q: Display the process tree. Find zombie processes and their parents.

Answer
pstree -p
ps aux | awk '$8 == "Z" {print $0}'
# Find parent of zombies:
ps -eo pid,ppid,stat,cmd | awk '$3 ~ /Z/ {print "Zombie PID:", $1, "Parent:", $2}'

Drill 9: cgroup Inspection (Medium)

Q: Check what cgroup a Docker container belongs to and its resource limits.

Answer
# Find container's cgroup
docker inspect CONTAINER --format '{{.HostConfig.Memory}}'
cat /proc/$(docker inspect CONTAINER --format '{{.State.Pid}}')/cgroup
systemd-cgtop                    # Live resource usage

Drill 10: Performance Triage (Hard)

Q: A server has load average 25 but only 4 CPUs. Diagnose whether it's CPU-bound or I/O-bound.

Answer
uptime && nproc                  # Load 25 vs 4 CPUs = 6x overloaded
iostat -xz 1 3                  # Check %iowait and await
# High iowait → disk bottleneck:
iotop -a -b -n 3 | head -20     # Which process?
# Low iowait → CPU contention:
ps aux --sort=-%cpu | head -10   # Who's consuming CPU?
mpstat -P ALL 1 3                # Per-CPU breakdown

Cheat Sheet

Process Management

ps aux                          # All processes
pgrep -f pattern                # Find by name
kill -15 PID                    # Graceful (SIGTERM)
kill -9 PID                     # Force (SIGKILL)
kill -0 PID                     # Check alive

systemd

systemctl status/start/stop/restart SERVICE
systemctl enable/disable SERVICE
systemctl list-units --failed
systemctl daemon-reload
journalctl -u SERVICE -f
journalctl -p err -b

Disk & Memory

df -h / df -i                   # Space / inodes
du -sh /* | sort -rh | head     # Biggest dirs
free -h                         # Memory
dmesg -T | grep oom             # OOM kills

Network

ss -tlnp                        # Listening ports
ip addr show                    # IP addresses
ip route show                   # Routes
dig domain +short               # DNS

Performance

uptime                          # Load average
iostat -xz 1 3                  # Disk I/O
vmstat 1 5                      # Memory/CPU/IO
top -bn1 | head -20             # Process overview

Permissions

chmod 755 file                  # rwxr-xr-x
chown user:group file
find / -perm -4000 -ls          # SUID files

Quick Triage Chain

systemctl status → journalctl -u → ss -tlnp → df -h → free -h → top → iostat

Self-Assessment

Boot and Kernel

  • I can explain the 5 boot stages (firmware → GRUB → kernel → initramfs → systemd)
  • I know what initramfs does and when to rebuild it
  • I understand PID 1's special role
  • I can use dmesg and systemd-analyze to debug boot issues

Processes and Services

  • I understand process states (R, S, D, Z, T)
  • I know the difference between SIGTERM and SIGKILL
  • I can write and manage systemd unit files
  • I can use drop-in overrides and timers
  • I can diagnose zombie processes

Filesystems and Storage

  • I understand inodes, hard links, and symlinks
  • I can diagnose disk full vs inode exhaustion
  • I can use LVM to create and extend volumes
  • I know the differences between ext4, XFS, and Btrfs

Memory and Performance

  • I understand MemAvailable vs MemFree
  • I know what the OOM killer does and how to investigate it
  • I can use the USE method for performance triage
  • I can distinguish I/O-bound from CPU-bound load

Networking and Security

  • I can read iptables rules and understand chain order
  • I can diagnose TCP state issues (CLOSE_WAIT, TIME_WAIT)
  • I can set up SSH key authentication and tunnels
  • I understand file permissions, SUID, and umask
  • I know basic hardening steps (SSH, firewall, sysctl)

Debugging

  • I can use strace to diagnose stuck/slow processes
  • I can navigate /proc to inspect process state
  • I can use journalctl to find service errors
  • I can perform a quick performance triage in 60 seconds