Skip to content

Linux - Foundations and Operations Guide

Scope: Modern Linux from boot to production operations - updated for systemd-era hosts, cgroup v2, nftables-era firewalling, and current distro realities.

Topics: Boot process, kernel, systemd, processes and signals, permissions, filesystems and storage, LVM, RAID, LUKS, memory, networking, DNS and NSS, nftables and iptables, SSH, /proc, strace, performance triage, logging, packages, text processing, cgroups and namespaces, hardening, eBPF, distro differences, on-call triage, drills, cheat sheet.

Level: L0-L2 (zero -> foundations -> operations)

What this guide is and is not: - This is a practical Linux foundations and operations guide. - It favors accurate mental models and field-useful commands over trivia and vendor marketing. - It is broad, but it is not magic. Some areas still deserve dedicated deep dives: storage recovery, advanced networking, SELinux policy authoring, kernel internals, and performance analysis at scale.


The Mission

A rack server powers on in a datacenter. In under a minute it goes from dead silicon to firmware, bootloader, kernel, initramfs, PID 1, services, sockets, filesystems, and a login prompt. Later you SSH in, restart a service, inspect logs, and fix a production issue. Linux is the stack connecting all of that.

The goal here is not to turn you into a command parrot. The goal is to make the machine legible.


Table of Contents

  1. The Boot Sequence
  2. The Kernel
  3. systemd
  4. Processes and Signals
  5. Users, Permissions, ACLs, and Capabilities
  6. The Filesystem
  7. Storage - Partitions, LVM, RAID, LUKS
  8. Memory Management
  9. Networking Fundamentals
  10. Firewalls - nftables First, iptables Legacy
  11. SSH
  12. The /proc Filesystem
  13. Debugging with strace
  14. Performance Triage
  15. Logging
  16. Package Management
  17. Text Processing
  18. cgroups and Namespaces
  19. Security Hardening
  20. eBPF
  21. Linux Distributions
  22. On-Call Survival Guide
  23. Real-World Case Studies
  24. Glossary
  25. Flashcards
  26. Drills
  27. Cheat Sheet
  28. Self-Assessment

Part 1: The Boot Sequence

You press the power button. Here is the practical version of what happens.

Stage 1: Firmware - BIOS or UEFI

The power supply stabilizes and emits a Power Good signal. The CPU starts executing from a fixed reset vector. At that instant there is no mounted disk, no userspace, no shell, and no kernel scheduler yet.

Legacy BIOS path:

Power on -> POST -> read first sector / boot code -> jump to bootloader

Modern UEFI path:

Power on -> POST -> read NVRAM boot entries -> load EFI executable from ESP -> bootloader runs

Feature BIOS UEFI
Partition table MBR GPT
Practical disk limit ~2 TB with classic MBR effectively enormous
Boot environment 16-bit constraints 32/64-bit firmware environment
Secure Boot No Yes
Bootloader location MBR + post-MBR tricks EFI binary on the ESP

Secure Boot, in the real world: - Firmware validates shim against keys in firmware. - shim validates GRUB or MokManager. - GRUB validates and loads the signed kernel. - The kernel enforces signature rules for loadable modules. - Initrd/initramfs images are commonly not part of that same validation chain, so do not imagine Secure Boot as a perfectly sealed steel coffin.

Stage 2: Bootloader - usually GRUB

GRUB is a tiny operating system whose job is to locate the kernel, hand it a command line, and usually provide a boot menu.

cat /proc/cmdline
systemd-analyze

Useful kernel parameters:

Parameter Purpose
root=UUID=... real root filesystem
ro mount root read-only first
systemd.unit=rescue.target rescue target
single or 1 traditional single-user shorthand
rd.break break into initramfs shell
init=/bin/bash bypass normal init entirely
console=ttyS0,115200 serial console

Do not edit generated GRUB config directly. - Debian/Ubuntu habit: edit /etc/default/grub and files in /etc/grub.d/, then run update-grub. - RHEL-family habit: use grub2-mkconfig, grubby, and distro-specific bootloader paths. - Cross-distro advice that says only update-grub is the answer is Debian-brained provincialism.

Stage 3: Kernel Initialization

The compressed kernel image decompresses and then: 1. sets up CPU mode and early memory management 2. builds page tables 3. initializes interrupt handling 4. probes buses and devices 5. initializes built-in drivers 6. mounts the initramfs as the temporary early root

dmesg | head -50
dmesg -T | grep -iE 'error|fail|oom|nvme|xfs|ext4'

Stage 4: Initramfs - the bridge to the real root

The kernel still needs enough tooling to find the real root filesystem. That might require storage drivers, RAID assembly, LUKS unlock, LVM activation, or network boot logic.

initramfs in RAM
├── /init
├── busybox or dracut tools
├── kernel modules
└── scripts to find and mount the real root

Failure here usually looks like: cannot find root device, dropped to emergency shell, or plain panic.

Common reasons: - wrong UUID on kernel command line - missing storage driver - broken RAID/LVM/LUKS setup - stale initramfs after controller or kernel changes

Rebuild commands vary:

# Debian / Ubuntu
update-initramfs -u

# RHEL / Fedora / Rocky / Alma
dracut --force

Stage 5: PID 1 takes over

The kernel executes the configured init binary, almost always systemd now.

PID 1 is special: - it is the ultimate parent of orphaned processes - if it exits, the kernel panics - signal semantics around PID 1 are special

systemd-analyze blame | head -20
systemd-analyze critical-chain

Part 2: The Kernel

What Linux actually is

Linux is the kernel, not the whole operating system.

users
shells and apps
libraries
syscalls
kernel
hardware

Everything from bash to nginx to systemd is userspace. The kernel mediates access to CPU, memory, filesystems, devices, and networking.

Key kernel concepts

Syscalls are the contract boundary. - file I/O: open, read, write, close - process control: fork, execve, wait4 - networking: socket, connect, accept - memory: mmap, mprotect, brk

Modules are loadable kernel components.

lsmod
modinfo ext4
modprobe br_netfilter

Kernel logs are where hardware truth often leaks out.

dmesg -T | tail -50
journalctl -k -b

sysctl exposes runtime kernel tuning.

sysctl net.ipv4.ip_forward
sysctl -w net.ipv4.ip_forward=1
sysctl --system

Practical rule: do not cargo-cult random sysctl snippets from the internet. A lot of them are cargo cult fossils from 2012 or break container, VPN, or routing behavior.


Part 3: systemd

systemd is the init system and service manager on most modern Linux distributions. It replaced linear shell-script boot with dependency-aware service management, supervision, logging integration, resource control, timers, sockets, and more.

Essential commands

systemctl status nginx
systemctl start nginx
systemctl stop nginx
systemctl restart nginx
systemctl reload nginx
systemctl enable nginx
systemctl disable nginx
systemctl enable --now nginx
systemctl list-units --failed
systemctl list-timers
systemctl daemon-reload

Units that matter most

Unit type Purpose
service long-running daemon or one-shot task
socket socket activation
timer scheduled task
mount / automount filesystem mounts
target grouping / boot milestone
path trigger on file path events
slice cgroup-based resource grouping
scope externally created process group

A sane service file

# /etc/systemd/system/myapp.service
[Unit]
Description=My Application
After=postgresql.service
Wants=postgresql.service

# Only add these if the app is a *client* that truly requires working network before start
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=deploy
Group=deploy
WorkingDirectory=/opt/myapp
ExecStart=/opt/myapp/bin/server --port 8080 --config /etc/myapp/config.yaml
Restart=on-failure
RestartSec=5
Environment=APP_ENV=production
MemoryMax=512M
CPUQuota=200%
LimitNOFILE=65536
NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/var/lib/myapp /var/log/myapp

[Install]
WantedBy=multi-user.target

Dependency semantics that trip people

Directive Meaning
After= ordering only
Before= ordering only
Wants= soft dependency
Requires= hard dependency
BindsTo= hard dependency with stronger lifecycle coupling
PartOf= propagate restart/stop actions

Big trap: network.target is not “the network is ready.” It mostly means networking stack startup has happened. Use network-online.target only for client software that actually must wait for configured connectivity. Most server daemons do not need it.

Drop-in overrides

Prefer overrides instead of editing packaged unit files.

systemctl edit nginx
systemctl cat nginx
systemctl show nginx -p FragmentPath -p DropInPaths

Example:

# /etc/systemd/system/nginx.service.d/override.conf
[Service]
MemoryMax=1G
LimitNOFILE=65536

Timers

Timers replace a lot of old cron use cases and integrate with service management.

# /etc/systemd/system/backup.timer
[Unit]
Description=Nightly backup

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true
RandomizedDelaySec=5m

[Install]
WantedBy=timers.target
systemctl enable --now backup.timer
systemctl list-timers --all

journald

journalctl -u nginx -f
journalctl -u nginx --since '1 hour ago'
journalctl -p err -b
journalctl -k
journalctl --disk-usage
journalctl --vacuum-size=500M
journalctl -o json-pretty -u nginx -n 1

Useful recovery targets

systemctl get-default
systemctl isolate rescue.target
systemctl isolate emergency.target

rescue.target tries to give you a usable single-user environment. emergency.target is even more minimal and rude.


Part 4: Processes and Signals

Process lifecycle

fork() -> child process created
execve() -> process image replaced with new program
wait() / waitpid() -> parent collects exit state
exit() -> process terminates

Every process has: - PID and PPID - credentials: UID, GID, groups - open file descriptors - memory mappings - cgroup membership - namespaces

ps aux
ps -eo pid,ppid,stat,%cpu,%mem,cmd --sort=-%cpu | head
pstree -p

Process states

State Meaning
R runnable or running
S interruptible sleep
D uninterruptible sleep, often I/O wait
T stopped
Z zombie

Zombies use almost no memory but they still consume PID table entries. Enough of them and fork() starts failing.

ps -eo pid,ppid,stat,cmd | awk '$3 ~ /Z/'

Signals

Signal Purpose
SIGHUP reload by convention
SIGINT interactive interrupt
SIGQUIT quit + core by default
SIGTERM graceful termination
SIGKILL uncatchable kill
SIGSTOP uncatchable stop
SIGCONT continue
SIGCHLD child state changed
kill PID
kill -TERM PID
kill -HUP PID
kill -0 PID
kill -9 PID
pkill -f 'python app'

Operator rule: 1. inspect first 2. SIGTERM second 3. SIGKILL only when grace failed or the thing is obviously wedged


Part 5: Permissions

The base permission model

-rwxr-xr-- 1 deploy www-data 4096 Mar 23 14:00 app.py
  • file: r read, w write, x execute
  • directory: r list names, w create/delete entries, x traverse
chmod 755 file
chmod 644 file
chmod u+x file
chown user:group file

Special bits

Bit Meaning
SUID execute as file owner
SGID execute as file group or inherit group on directory
sticky only owner can delete entries in directory
find / -perm -4000 -ls 2>/dev/null
chmod g+s /srv/shared
chmod +t /tmp

umask

umask

Common values: - 0022 -> files 644, dirs 755 - 0002 -> files 664, dirs 775 - 0077 -> private by default

ACLs - when rwx is too blunt

Traditional mode bits are coarse. ACLs add per-user and per-group entries.

getfacl file
setfacl -m u:alice:r file
setfacl -m g:ops:rwX /srv/app
setfacl -d -m g:ops:rwX /srv/app

Use ACLs for shared directories and controlled exceptions. Do not turn them into a haunted forest of invisible permissions nobody remembers.

sudo and visudo

Do not hand-edit /etc/sudoers like a maniac with a flamethrower.

visudo
visudo -c

Prefer small files in /etc/sudoers.d/.

Example:

%wheel ALL=(ALL:ALL) ALL
ops ALL=(root) NOPASSWD: /usr/bin/systemctl restart nginx

Linux capabilities

Root used to mean almost all power. Capabilities split that power into smaller pieces.

Examples: - CAP_NET_BIND_SERVICE - bind to ports below 1024 - CAP_NET_ADMIN - network administration operations - CAP_SYS_TIME - set system clock - CAP_SYS_ADMIN - the kitchen-sink monster; avoid when possible

Inspect and set file capabilities:

getcap /path/to/binary
setcap cap_net_bind_service=+ep /usr/local/bin/myweb

Capabilities are great for least privilege. They are also a good way to create weird bugs if you do not understand effective, permitted, inheritable, and ambient sets.

MAC - SELinux and AppArmor

DAC says what the file owner and mode bits allow. MAC says what policy allows, regardless of owner intent.

  • SELinux is label-based and powerful.
  • AppArmor is path-based and usually easier to approach.

Quick checks:

# SELinux
getenforce
restorecon -Rv /var/www
ausearch -m avc -ts recent

# AppArmor
aa-status
apparmor_status

If a service gets EACCES but mode bits look fine, think MAC.


Part 6: The Filesystem

Everything is a file-ish thing

Regular files, directories, symlinks, block devices, character devices, sockets, pipes, procfs, sysfs - Linux represents a lot of system state through file-like interfaces.

Important paths

Path Purpose
/ root
/etc configuration
/var variable data
/home user homes
/root root home
/tmp temporary files
/run runtime state, often tmpfs
/proc process and kernel state
/sys device and driver state
/dev device nodes
/boot kernel and bootloader assets
/opt optional third-party software
/srv site/service data

Inodes

An inode stores metadata: ownership, mode, timestamps, size, block pointers, and more. Filenames live in directory entries, not inodes.

ls -i file
stat file
df -i

When df -h says there is space but writes still fail, check: - df -i for inode exhaustion - read-only remounts - quotas - deleted-open-file leaks

ln file hardlink
ln -s file symlink
  • hard link -> same inode, same filesystem only
  • symlink -> path reference, can cross filesystems, can dangle

VFS

The Virtual Filesystem layer lets the same syscalls work across ext4, XFS, tmpfs, NFS, overlayfs, procfs, and friends.

Common filesystem types

Filesystem Best use
ext4 sane general-purpose default
XFS big filesystems, high throughput, default on many RHEL systems
Btrfs snapshots, checksums, compression, advanced features
tmpfs RAM-backed temporary data
overlayfs container layers
NFS network file sharing

Part 7: Storage

Block devices and partitions

lsblk
lsblk -f
blkid
fdisk -l
parted -l
findmnt
cat /etc/fstab

LVM - storage virtualization that matters

physical disks -> physical volumes -> volume groups -> logical volumes -> filesystems
pvcreate /dev/sdb1
vgcreate data /dev/sdb1
lvcreate -L 50G -n app data
mkfs.ext4 /dev/data/app
mount /dev/data/app /srv/app

Growth example:

lvextend -L +20G /dev/data/app
resize2fs /dev/data/app           # ext4
xfs_growfs /srv/app               # XFS uses mountpoint

Resize caveats worth tattooing on your frontal lobe

  • ext4 can usually grow online; shrinking requires the filesystem to be unmounted.
  • XFS growth is easy; shrinking is generally not a normal operation to rely on.
  • Always understand the full stack: partition/LV size, then filesystem size, not just one layer.
  • Backups first. Heroic confidence after coffee is not a backup strategy.

RAID levels

RAID Use
RAID 0 speed, zero redundancy
RAID 1 mirror
RAID 5 one-disk parity, rebuild risk rises with size
RAID 6 two-disk parity
RAID 10 mirror + stripe, great practical default for important write-heavy workloads

Software RAID basics

cat /proc/mdstat
mdadm --detail /dev/md0

When an array is degraded: - performance often drops - risk during rebuild rises - do not celebrate because it is “still up” - watch SMART data and rebuild progress

Disk health

smartctl -a /dev/sda
smartctl -t short /dev/sda
iostat -xz 1 5
iotop

Mount options that matter

Option Use
noexec block direct binary execution
nosuid ignore SUID/SGID
nodev ignore device nodes
ro read-only
noatime reduce access-time writes in some cases

LUKS - disk encryption basics

LUKS is the common Linux standard for block-device encryption.

cryptsetup luksFormat /dev/sdb1
cryptsetup open /dev/sdb1 secure_data
mkfs.ext4 /dev/mapper/secure_data

Files involved: - /etc/crypttab - what to unlock at boot - initramfs - often required for encrypted root

Backup the LUKS header when appropriate. Lose it and your encrypted data may become modern art.


Part 8: Memory Management

Big picture

Linux tries to use RAM aggressively. File cache is good. Empty RAM is mostly wasted opportunity.

free -h
cat /proc/meminfo | head -30

The field that usually matters most is MemAvailable, not MemFree.

Memory types

Type Meaning
anonymous heap, stack, private mappings
page cache cached file data
slab kernel object caches
shared/tmpfs shared pages
kernel memory kernel code and data

Virtual memory

Each process sees a virtual address space. The kernel maps that to physical memory. This gives isolation, lazy allocation, copy-on-write, and mmap-backed files.

cat /proc/PID/maps
cat /proc/PID/smaps_rollup
pmap PID

Swap

swapon --show
cat /proc/swaps
sysctl vm.swappiness

Swap is not evil. Blindly disabling swap everywhere is meme-ops. But sustained swapping means pressure exists and you should understand why.

OOM killer

dmesg -T | grep -i 'oom\|killed process'
journalctl -k -g 'oom\|Killed process'
cat /proc/PID/oom_score
cat /proc/PID/oom_score_adj

Useful idea: - if the kernel is killing things, the argument is already over - now you are doing forensics, not philosophy

Memory triage

free -h
vmstat 1 5
ps aux --sort=-%mem | head -15
slabtop

Part 9: Networking Fundamentals

Interfaces and addresses

ip addr show
ip -br addr
ip link show
ip route show
ip neigh show

Prefer ip over old ifconfig and route. The legacy commands still exist in many places, but the iproute2 tools are the modern interface.

DNS, NSS, and why dig is not the whole truth

There are multiple layers here: - /etc/hosts - /etc/nsswitch.conf - libc resolver behavior - systemd-resolved on many systems - /etc/resolv.conf - upstream DNS servers

So: - dig example.com asks DNS directly. - getent hosts example.com asks the system resolver path configured by NSS. - those are not the same test.

getent hosts example.com
dig example.com +short
resolvectl status
resolvectl query example.com
cat /etc/nsswitch.conf
ls -l /etc/resolv.conf

If a host resolves with dig but not with getent, the problem may be NSS, search domains, systemd-resolved, or /etc/hosts, not raw DNS reachability.

/etc/resolv.conf realities

On systems using systemd-resolved, /etc/resolv.conf may be: - a symlink to the stub resolver config using 127.0.0.53 - a symlink to a generated file listing upstream resolvers - a static file managed by something else

Do not assume it is a normal hand-edited file anymore.

Connectivity tests

ping host
tracepath host
traceroute host
nc -zv host 443
curl -v telnet://host:443
ss -tlnp
tcpdump -i eth0 port 443

TCP states worth knowing

State Meaning Common interpretation
LISTEN waiting for inbound connections normal for servers
ESTAB connection active normal
TIME-WAIT recently closed many short-lived connections
CLOSE-WAIT peer closed, local side has not application bug or leak
SYN-SENT outbound connect in progress upstream unreachable or filtered

Many CLOSE-WAIT sockets usually mean your application is failing to close descriptors after the peer has gone away.

Bridges, bonds, VLANs - the one-screen version

  • bridge - software L2 switch joining interfaces into one broadcast domain
  • bond/team - combine multiple NICs for redundancy or aggregated bandwidth
  • VLAN - isolate traffic at layer 2 using tagged networks

Quick examples:

bridge link
bridge vlan show
ip -d link show
cat /proc/net/bonding/bond0

If you work around virtualization, hypervisors, KVM, Proxmox, libvirt, or container hosts, bridges and VLANs stop being “advanced” and become Tuesday.

Policy routing and multiple tables

Sometimes the right route depends on source IP, mark, or interface. That is policy routing, not basic destination lookup.

ip rule show
ip route show table main
ip route show table all

If VPN, multihoming, or weird asymmetric paths are involved, look here.


Part 10: Firewalls

Linux firewalling today is nftables-first conceptually, even when older tools are still in circulation.

nftables mental model

  • tables hold chains
  • chains hold rules
  • rules match packets and take actions
  • one ruleset can cover IPv4 and IPv6 cleanly

Example host firewall:

nft list ruleset

Example config:

table inet filter {
  chain input {
    type filter hook input priority 0;
    policy drop;

    ct state established,related accept
    iif lo accept
    tcp dport { 22, 80, 443 } accept
    ip protocol icmp accept
    ip6 nexthdr icmpv6 accept
  }
}

Apply safely:

nft -f /etc/nftables.conf

iptables still matters

You will still see iptables because: - old docs never die - Docker, kube-proxy, fail2ban, and assorted tools still expose iptables-shaped behavior - many distributions ship iptables compatibility frontends backed by nftables underneath

Useful commands:

iptables -L -n -v --line-numbers
iptables -t nat -L -n -v
iptables-save

Prefer conntrack syntax over the older state match when writing new iptables rules:

iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT

Firewalld and UFW

  • firewalld is common on RHEL-family systems
  • UFW is common on Ubuntu
  • both are frontends, not the kernel firewall engine itself
firewall-cmd --state
firewall-cmd --get-active-zones
ufw status verbose

Rule ordering still matters

Whether nftables or iptables, careless DROP rules can lock you out. Keep your current connection safe before you get clever.


Part 11: SSH

What happens on connect

  1. TCP connect to port 22
  2. key exchange
  3. host key verification
  4. user authentication
  5. channel/session setup

Host trust hygiene

SSH uses TOFU - trust on first use - unless you pre-seed trust another way.

Important files: - ~/.ssh/known_hosts - ~/.ssh/config - ~/.ssh/id_ed25519

Key types

Type Advice
Ed25519 preferred general default
RSA legacy compatibility
ECDSA acceptable
DSA dead, leave it buried
ssh-keygen -t ed25519 -C 'deploy@example'
ssh-copy-id user@host

Config file

Host prod-*
    User deploy
    IdentityFile ~/.ssh/deploy_ed25519
    ProxyJump bastion.example.com

Host db-primary
    HostName 10.0.2.50
    User postgres
    Port 2222

Tunnels

ssh -L 8080:localhost:80 user@remote
ssh -R 8080:localhost:3000 user@remote
ssh -D 1080 user@remote
ssh -J bastion internal-host

Agent forwarding

ssh -A bastion

Use sparingly. Root on the intermediate host can potentially abuse your forwarded agent. ProxyJump is often the cleaner answer.


Part 12: The /proc Filesystem

/proc is a virtual filesystem exposing kernel and process state.

Per-process inspection

cat /proc/$$/cmdline | tr '\0' ' '
ls -la /proc/$$/cwd
cat /proc/$$/environ | tr '\0' '\n' | head
cat /proc/$$/status
ls -la /proc/$$/fd
cat /proc/$$/maps

Secrets warning: environment variables are not a magical safe. Same-user or root access can often inspect them.

System-wide files

cat /proc/meminfo
cat /proc/cpuinfo
cat /proc/loadavg
cat /proc/uptime
cat /proc/sys/kernel/pid_max
cat /proc/net/tcp

Deleted open files

lsof +L1
find /proc/*/fd -ls 2>/dev/null | grep deleted

If a file is deleted but still open, disk space is not reclaimed until the process closes it or dies.


Part 13: Debugging with strace

strace shows syscalls. That is often enough to expose what a process is actually waiting on.

strace -p 12345
strace -p 12345 -t -T
strace -f ./deploy.sh
strace -e trace=file ./myapp
strace -e trace=network ./myapp

Patterns

Stuck process

strace -p PID
# blocked on read, connect, poll, futex, openat, etc.

Slow startup

strace -T -e trace=file,network ./myapp 2>&1 | sort -t'<' -k2 -rn | head

Permission denied

strace -e trace=file ./myapp 2>&1 | grep EACCES

strace is not subtle, but subtle is overrated at 3 AM.


Part 14: Performance Triage

USE method

For each resource, check: - Utilization - Saturation - Errors

Resource Utilization Saturation Errors
CPU top, mpstat run queue kernel or hardware complaints
Memory free -h, vmstat swap, reclaim, OOM OOM logs
Disk iostat -xz await, queue depth I/O errors
Network sar -n DEV, ip -s link drops, backlog, retransmits driver/link errors

Quick triage chain

uptime
free -h
df -h
df -i
dmesg -T | tail -30
iostat -xz 1 3
ss -s
ps -eo pid,ppid,%cpu,%mem,stat,cmd --sort=-%cpu | head
ps -eo pid,ppid,%cpu,%mem,stat,cmd --sort=-%mem | head

Load average

Load is runnable tasks plus tasks stuck in uninterruptible sleep, usually I/O.

uptime
nproc

High load with low CPU usage often means I/O pain, not CPU pain.


Part 15: Logging

Common places

/var/log/syslog or /var/log/messages
/var/log/auth.log or secure
/var/log/kern.log
application logs under /var/log/<app>/

journald essentials

journalctl -u nginx -f
journalctl -u nginx --since '1 hour ago'
journalctl -b -p err
journalctl -k
journalctl --disk-usage
journalctl --vacuum-size=500M

logrotate

Example:

/var/log/myapp/*.log {
    daily
    rotate 14
    compress
    delaycompress
    missingok
    notifempty
    postrotate
        systemctl reload myapp
    endscript
}

Important modern nuance: - on some systems log rotation is driven by cron - on others it is driven by a systemd timer such as logrotate.timer - do not assume cron is the scheduler without checking

systemctl status logrotate.timer
systemctl list-timers | grep logrotate

Part 16: Package Management

Debian / Ubuntu

apt update
apt upgrade
apt install nginx
apt remove nginx
apt purge nginx
apt search nginx
apt-cache policy nginx
dpkg -l | grep nginx
dpkg -L nginx

RHEL / Fedora / Rocky / Alma

dnf install nginx
dnf upgrade
dnf info nginx
dnf remove nginx
dnf list installed | grep nginx
rpm -qa | grep nginx
rpm -ql nginx

Package hygiene

  • prefer vendor packages or known repositories over random curl-pipe installers
  • understand what created a file before you edit or delete it
  • config drift and package ownership matter

Part 17: Text Processing

Pipeline mindset

grep ' 500 ' access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head

Core tools

grep -r 'TODO' src/
grep -E 'error|warn' file
awk '{print $1}' file
awk -F: '{print $1,$7}' /etc/passwd
sed -n '10,20p' file
sed 's/old/new/g' file
sort -rn
uniq -c
cut -d: -f1 /etc/passwd
tr 'a-z' 'A-Z'
head -20 file
tail -f file
tee file
xargs

One nitpick worth keeping: avoid useless cat file | grep pattern when grep pattern file does the job. It is not a moral issue, just cleaner.


Part 18: cgroups and Namespaces

cgroup v2

Modern Linux increasingly means cgroup v2: a single unified hierarchy.

Useful checks:

mount | grep cgroup
stat -fc %T /sys/fs/cgroup
cat /proc/self/cgroup
systemd-cgls
systemd-cgtop

Common files on cgroup v2 systems:

cat /sys/fs/cgroup/cgroup.controllers
cat /sys/fs/cgroup/memory.current
cat /sys/fs/cgroup/memory.max
cat /sys/fs/cgroup/cpu.max

Per-service example:

systemctl show nginx -p ControlGroup
CG=$(systemctl show nginx -p ControlGroup --value)
cat /sys/fs/cgroup${CG}/memory.current

Namespaces

Namespace Isolates
PID process IDs
net network stack
mount mount table
UTS hostname
user UID/GID mappings
IPC shared IPC objects
cgroup cgroup view
time time namespaces on supported systems
ls -la /proc/PID/ns
ip netns add test
ip netns exec test ip addr

Containers are mostly cgroups + namespaces + filesystem layering + runtime tooling.


Part 19: Security Hardening

SSH daemon baseline

PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers deploy admin
MaxAuthTries 3

Least privilege

  • run services as dedicated users
  • use capabilities when a narrow privilege is enough
  • use sudoers.d instead of handing out full root casually
  • restrict writable paths in systemd units

MAC controls

  • SELinux: powerful label-based enforcement
  • AppArmor: simpler path-based confinement on many Ubuntu systems

Firewall baseline

  • default deny inbound unless host role says otherwise
  • allow only required services
  • document exceptions
  • beware container/orchestrator interaction with host firewall rules

Kernel and sysctl hardening

Example baseline ideas:

net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.tcp_syncookies = 1
kernel.dmesg_restrict = 1
fs.protected_hardlinks = 1
fs.protected_symlinks = 1

Patching and provenance

  • keep the OS current
  • know which repos you trust
  • verify what owns a binary and where it came from
  • avoid mystery curl scripts unless you have reviewed them

Auditing

auditctl -w /etc/passwd -p wa -k passwd_changes
ausearch -k passwd_changes
last
lastb

Part 20: eBPF

eBPF lets you run verified sandboxed programs in the kernel for observability, networking, and security uses.

Examples:

bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
execsnoop
opensnoop
biolatency
tcpconnect

It is absurdly powerful. It is also not beginner-friendly when you leave the one-liner lane.


Part 21: Linux Distributions

Family Examples Package tools Common defaults
Debian Debian, Ubuntu apt, dpkg ext4, AppArmor often on Ubuntu
Red Hat RHEL, Rocky, Alma, Fedora dnf, rpm XFS common, SELinux strong
SUSE SLES, openSUSE zypper, rpm Btrfs common on root
Arch Arch, Endeavour pacman rolling release
Alpine Alpine apk musl, small footprint

Core Linux skills transfer. Packaging, defaults, release model, and support policies are where the families diverge.


Part 22: On-Call Survival Guide

Disk full

df -h
df -i
du -xhd1 /var | sort -h
lsof +L1
journalctl --disk-usage

OOM

dmesg -T | grep -i 'oom\|killed process'
free -h
ps aux --sort=-%mem | head -15

Service failed

systemctl status SERVICE
journalctl -u SERVICE -n 100 --no-pager
ss -tlnp | grep PORT
systemctl cat SERVICE

High load

uptime
nproc
iostat -xz 1 3
vmstat 1 5
top -bn1 | head -30

Safe vs dangerous

Usually safe Usually dangerous
read logs and status kill -9 on business-critical daemons
inspect sockets, pids, mounts deleting unknown files under pressure
collect evidence rebooting before you know what happened
journal vacuum with intent docker system prune in anger

Part 23: Real-World Case Studies

Case 1: OOM kills the app

Symptom: app dies, app logs say little.

Investigation: dmesg shows the kernel killed it. Heap or process memory budget assumed the host belonged entirely to one process.

Fix: reduce heap, add memory limits, add monitoring, leave headroom for kernel and cache.

Case 2: Disk “full” but df looks okay

Symptom: app still cannot write.

Investigation: df -i shows inode exhaustion or lsof +L1 shows giant deleted-open logs.

Fix: clean tiny-file storm or restart/rotate the offending process correctly.

Case 3: Zombie army

Symptom: fork() fails with EAGAIN.

Investigation: parent process is not reaping children. Zombies pile up.

Fix: fix the parent, restart it, or kill it so PID 1 adopts and reaps the zombies.

Case 4: Service flapping under systemd

Symptom: service restarts every few seconds and hits start limit.

Investigation: journalctl -u reveals bad config path, bad permissions, or missing dependency.

Fix: use absolute paths, correct WorkingDirectory, fix config, then systemctl reset-failed SERVICE.

Case 5: Logs eat root

Symptom: SSH slow, commands fail, temp files cannot be created.

Investigation: giant logs, failed rotation, or runaway debug mode.

Fix: truncate carefully if the file is open, repair rotation scheduling, consider separate /var.

Case 6: High load, low CPU

Symptom: load average huge, CPUs not pegged.

Investigation: iostat shows long await; tasks are stuck in I/O wait.

Fix: storage bottleneck, not CPU bottleneck. Different war, different tools.


Glossary

Term Meaning
kernel core of the operating system
syscall userspace entry into kernel services
PID 1 init process, usually systemd
inode file metadata record
file descriptor numeric handle for open file/socket/pipe
page cache RAM used for file caching
OOM killer kernel logic that kills tasks under extreme memory exhaustion
cgroup resource control grouping
namespace isolation boundary
unit systemd-managed object
target systemd grouping / boot milestone
initramfs temporary early root in RAM
ESP EFI System Partition
LVM Logical Volume Manager
LUKS Linux block-device encryption standard
ACL access control list
capability fine-grained kernel privilege
MAC mandatory access control
TOFU trust on first use
NSS Name Service Switch

Flashcards

Boot and kernel

Q A
Boot chain in order? firmware -> bootloader -> kernel -> initramfs -> PID 1
What is initramfs for? early userspace needed to reach the real root filesystem
What happens if PID 1 exits? kernel panic
network.target vs network-online.target? startup marker vs actual wait-for-network target

Processes and permissions

Q A
SIGTERM vs SIGKILL? graceful request vs uncatchable kill
What is a zombie? exited process not yet reaped
Why use ACLs? mode bits are too coarse for some sharing needs
Why use capabilities? narrow privileges instead of full root

Storage and memory

Q A
ext4 shrink? possible, usually offline
XFS shrink? not something to plan your life around
MemFree or MemAvailable? MemAvailable
What does lsof +L1 find? deleted files still open

Networking and security

Q A
dig vs getent hosts? raw DNS query vs system resolver/NSS path
Many CLOSE-WAIT sockets mean? app is not closing connections
nftables or iptables first? nftables first for modern mental model
What protects privileged port binding without full root? CAP_NET_BIND_SERVICE

Drills

Drill 1: Read the local boot and init path

cat /proc/cmdline
ps -p 1 -o pid,comm,args
systemd-analyze
systemd-analyze critical-chain

Drill 2: Inspect a running process deeply

PID=$(pgrep -n sshd)
cat /proc/$PID/status
ls -la /proc/$PID/fd | head
cat /proc/$PID/cgroup

Drill 3: Compare DNS tools

dig example.com +short
getent hosts example.com
resolvectl query example.com

Explain why the answers may differ.

Drill 4: Find deleted open files

lsof +L1

Drill 5: Add a systemd override safely

systemctl edit nginx
systemctl daemon-reload
systemctl restart nginx
systemctl cat nginx

Drill 6: Inspect cgroup v2 data for a service

CG=$(systemctl show ssh -p ControlGroup --value)
echo "$CG"
cat /sys/fs/cgroup${CG}/memory.current
cat /sys/fs/cgroup${CG}/cpu.stat

Drill 7: Check ACLs and capabilities

getfacl /srv/shared
getcap -r /usr/local/bin /usr/bin 2>/dev/null | head

Drill 8: One-minute triage drill

Collect these with no commentary first:

uptime
free -h
df -h
df -i
dmesg -T | tail -20
ss -s

Then write a three-sentence diagnosis hypothesis.


Cheat Sheet

Process and service control

ps aux
pstree -p
kill -TERM PID
kill -9 PID
systemctl status SERVICE
journalctl -u SERVICE -n 50 --no-pager

Disk and memory

df -h
df -i
du -xhd1 /var | sort -h
free -h
vmstat 1 5
lsof +L1

Network and DNS

ip -br addr
ip route
ss -tlnp
getent hosts name
dig name +short
resolvectl status

Firewall

nft list ruleset
iptables -L -n -v --line-numbers
firewall-cmd --list-all
ufw status verbose

Storage

lsblk -f
blkid
findmnt
cat /proc/mdstat
pvs && vgs && lvs

Quick triage chain

systemctl status -> journalctl -> ss -> df -h / df -i -> free -h -> dmesg -> iostat/vmstat

Self-Assessment

  • I can explain the boot chain without hand-waving.
  • I understand the difference between the kernel and userspace.
  • I know when network-online.target is appropriate and when it is not.
  • I can diagnose process states including zombies and D state tasks.
  • I understand mode bits, ACLs, sudo, and capabilities.
  • I know the practical difference between ext4 and XFS growth/shrink behavior.
  • I can investigate DNS using both dig and getent.
  • I can inspect a ruleset with nftables and still survive legacy iptables environments.
  • I can use /proc and strace to make a stuck process less mysterious.
  • I can perform a 60-second triage without immediately reaching for superstition.

Notes on Scope

This guide intentionally corrected and modernized several common Linux-teaching mistakes: - it treats nftables as the modern firewall model, while still covering legacy iptables - it treats cgroup v2 as the modern baseline - it distinguishes network.target from network-online.target - it separates DNS testing from system name-resolution testing - it includes ACLs, capabilities, sudo hygiene, AppArmor/SELinux, LUKS, and storage resize caveats that broad “Linux complete guides” often skip

That makes it less flashy than a “one doc explains literally everything forever” claim, but far more trustworthy.


Verification Notes

Modernized sections in this revision were checked against current upstream or vendor documentation for these areas: - systemd service ordering and network-online.target behavior - cgroup v2 unified hierarchy - nftables as the modern Netfilter framework and iptables compatibility layers - distro-specific GRUB regeneration workflows - Secure Boot chain details, including initrd nuance and module validation - systemd-resolved, resolvectl, and /etc/resolv.conf modes - AppArmor, SELinux, ACL, capability, and visudo behavior - logrotate scheduling via systemd timers on modern systems

That does not make every sentence timeless. Linux changes. But it removes the obvious stale landmines from the prior draft.