Linux - Foundations and Operations Guide¶

Scope: Modern Linux from boot to production operations - updated for systemd-era hosts, cgroup v2, nftables-era firewalling, and current distro realities.

Topics: Boot process, kernel, systemd, processes and signals, permissions, filesystems and storage, LVM, RAID, LUKS, memory, networking, DNS and NSS, nftables and iptables, SSH, /proc, strace, performance triage, logging, packages, text processing, cgroups and namespaces, hardening, eBPF, distro differences, on-call triage, drills, cheat sheet.

Level: L0-L2 (zero -> foundations -> operations)

What this guide is and is not: - This is a practical Linux foundations and operations guide. - It favors accurate mental models and field-useful commands over trivia and vendor marketing. - It is broad, but it is not magic. Some areas still deserve dedicated deep dives: storage recovery, advanced networking, SELinux policy authoring, kernel internals, and performance analysis at scale.

The Mission¶

A rack server powers on in a datacenter. In under a minute it goes from dead silicon to firmware, bootloader, kernel, initramfs, PID 1, services, sockets, filesystems, and a login prompt. Later you SSH in, restart a service, inspect logs, and fix a production issue. Linux is the stack connecting all of that.

The goal here is not to turn you into a command parrot. The goal is to make the machine legible.

Table of Contents¶

The Boot Sequence
The Kernel
systemd
Processes and Signals
Users, Permissions, ACLs, and Capabilities
The Filesystem
Storage - Partitions, LVM, RAID, LUKS
Memory Management
Networking Fundamentals
Firewalls - nftables First, iptables Legacy
SSH
The /proc Filesystem
Debugging with strace
Performance Triage
Logging
Package Management
Text Processing
cgroups and Namespaces
Security Hardening
eBPF
Linux Distributions
On-Call Survival Guide
Real-World Case Studies
Glossary
Flashcards
Drills
Cheat Sheet
Self-Assessment

Part 1: The Boot Sequence¶

You press the power button. Here is the practical version of what happens.

Stage 1: Firmware - BIOS or UEFI¶

The power supply stabilizes and emits a Power Good signal. The CPU starts executing from a fixed reset vector. At that instant there is no mounted disk, no userspace, no shell, and no kernel scheduler yet.

Legacy BIOS path:

Power on -> POST -> read first sector / boot code -> jump to bootloader

Modern UEFI path:

Power on -> POST -> read NVRAM boot entries -> load EFI executable from ESP -> bootloader runs

Feature	BIOS	UEFI
Partition table	MBR	GPT
Practical disk limit	~2 TB with classic MBR	effectively enormous
Boot environment	16-bit constraints	32/64-bit firmware environment
Secure Boot	No	Yes
Bootloader location	MBR + post-MBR tricks	EFI binary on the ESP

Secure Boot, in the real world: - Firmware validates shim against keys in firmware. - shim validates GRUB or MokManager. - GRUB validates and loads the signed kernel. - The kernel enforces signature rules for loadable modules. - Initrd/initramfs images are commonly not part of that same validation chain, so do not imagine Secure Boot as a perfectly sealed steel coffin.

Stage 2: Bootloader - usually GRUB¶

GRUB is a tiny operating system whose job is to locate the kernel, hand it a command line, and usually provide a boot menu.

cat /proc/cmdline
systemd-analyze

Useful kernel parameters:

Parameter	Purpose
`root=UUID=...`	real root filesystem
`ro`	mount root read-only first
`systemd.unit=rescue.target`	rescue target
`single` or `1`	traditional single-user shorthand
`rd.break`	break into initramfs shell
`init=/bin/bash`	bypass normal init entirely
`console=ttyS0,115200`	serial console

Do not edit generated GRUB config directly. - Debian/Ubuntu habit: edit /etc/default/grub and files in /etc/grub.d/, then run update-grub. - RHEL-family habit: use grub2-mkconfig, grubby, and distro-specific bootloader paths. - Cross-distro advice that says only update-grub is the answer is Debian-brained provincialism.

Stage 3: Kernel Initialization¶

The compressed kernel image decompresses and then: 1. sets up CPU mode and early memory management 2. builds page tables 3. initializes interrupt handling 4. probes buses and devices 5. initializes built-in drivers 6. mounts the initramfs as the temporary early root

dmesg | head -50
dmesg -T | grep -iE 'error|fail|oom|nvme|xfs|ext4'

Stage 4: Initramfs - the bridge to the real root¶

The kernel still needs enough tooling to find the real root filesystem. That might require storage drivers, RAID assembly, LUKS unlock, LVM activation, or network boot logic.

initramfs in RAM
├── /init
├── busybox or dracut tools
├── kernel modules
└── scripts to find and mount the real root

Failure here usually looks like: cannot find root device, dropped to emergency shell, or plain panic.

Common reasons: - wrong UUID on kernel command line - missing storage driver - broken RAID/LVM/LUKS setup - stale initramfs after controller or kernel changes

Rebuild commands vary:

# Debian / Ubuntu
update-initramfs -u

# RHEL / Fedora / Rocky / Alma
dracut --force

Stage 5: PID 1 takes over¶

The kernel executes the configured init binary, almost always systemd now.

PID 1 is special: - it is the ultimate parent of orphaned processes - if it exits, the kernel panics - signal semantics around PID 1 are special

systemd-analyze blame | head -20
systemd-analyze critical-chain

Part 2: The Kernel¶

What Linux actually is¶

Linux is the kernel, not the whole operating system.

users
shells and apps
libraries
syscalls
kernel
hardware

Everything from bash to nginx to systemd is userspace. The kernel mediates access to CPU, memory, filesystems, devices, and networking.

Key kernel concepts¶

Syscalls are the contract boundary. - file I/O: open, read, write, close - process control: fork, execve, wait4 - networking: socket, connect, accept - memory: mmap, mprotect, brk

Modules are loadable kernel components.

lsmod
modinfo ext4
modprobe br_netfilter

Kernel logs are where hardware truth often leaks out.

dmesg -T | tail -50
journalctl -k -b

sysctl exposes runtime kernel tuning.

sysctl net.ipv4.ip_forward
sysctl -w net.ipv4.ip_forward=1
sysctl --system

Practical rule: do not cargo-cult random sysctl snippets from the internet. A lot of them are cargo cult fossils from 2012 or break container, VPN, or routing behavior.

Part 3: systemd¶

systemd is the init system and service manager on most modern Linux distributions. It replaced linear shell-script boot with dependency-aware service management, supervision, logging integration, resource control, timers, sockets, and more.

Essential commands¶

systemctl status nginx
systemctl start nginx
systemctl stop nginx
systemctl restart nginx
systemctl reload nginx
systemctl enable nginx
systemctl disable nginx
systemctl enable --now nginx
systemctl list-units --failed
systemctl list-timers
systemctl daemon-reload

Units that matter most¶

Unit type	Purpose
`service`	long-running daemon or one-shot task
`socket`	socket activation
`timer`	scheduled task
`mount` / `automount`	filesystem mounts
`target`	grouping / boot milestone
`path`	trigger on file path events
`slice`	cgroup-based resource grouping
`scope`	externally created process group

A sane service file¶

# /etc/systemd/system/myapp.service
[Unit]
Description=My Application
After=postgresql.service
Wants=postgresql.service

# Only add these if the app is a *client* that truly requires working network before start
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=deploy
Group=deploy
WorkingDirectory=/opt/myapp
ExecStart=/opt/myapp/bin/server --port 8080 --config /etc/myapp/config.yaml
Restart=on-failure
RestartSec=5
Environment=APP_ENV=production
MemoryMax=512M
CPUQuota=200%
LimitNOFILE=65536
NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/var/lib/myapp /var/log/myapp

[Install]
WantedBy=multi-user.target

Dependency semantics that trip people¶

Directive	Meaning
`After=`	ordering only
`Before=`	ordering only
`Wants=`	soft dependency
`Requires=`	hard dependency
`BindsTo=`	hard dependency with stronger lifecycle coupling
`PartOf=`	propagate restart/stop actions

Big trap: network.target is not “the network is ready.” It mostly means networking stack startup has happened. Use network-online.target only for client software that actually must wait for configured connectivity. Most server daemons do not need it.

Drop-in overrides¶

Prefer overrides instead of editing packaged unit files.

systemctl edit nginx
systemctl cat nginx
systemctl show nginx -p FragmentPath -p DropInPaths

Example:

# /etc/systemd/system/nginx.service.d/override.conf
[Service]
MemoryMax=1G
LimitNOFILE=65536

Timers¶

Timers replace a lot of old cron use cases and integrate with service management.

# /etc/systemd/system/backup.timer
[Unit]
Description=Nightly backup

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true
RandomizedDelaySec=5m

[Install]
WantedBy=timers.target

systemctl enable --now backup.timer
systemctl list-timers --all

journald¶

journalctl -u nginx -f
journalctl -u nginx --since '1 hour ago'
journalctl -p err -b
journalctl -k
journalctl --disk-usage
journalctl --vacuum-size=500M
journalctl -o json-pretty -u nginx -n 1

Useful recovery targets¶

systemctl get-default
systemctl isolate rescue.target
systemctl isolate emergency.target

rescue.target tries to give you a usable single-user environment. emergency.target is even more minimal and rude.

Part 4: Processes and Signals¶

Process lifecycle¶

fork() -> child process created
execve() -> process image replaced with new program
wait() / waitpid() -> parent collects exit state
exit() -> process terminates

Every process has: - PID and PPID - credentials: UID, GID, groups - open file descriptors - memory mappings - cgroup membership - namespaces

ps aux
ps -eo pid,ppid,stat,%cpu,%mem,cmd --sort=-%cpu | head
pstree -p

Process states¶

State	Meaning
`R`	runnable or running
`S`	interruptible sleep
`D`	uninterruptible sleep, often I/O wait
`T`	stopped
`Z`	zombie

Zombies use almost no memory but they still consume PID table entries. Enough of them and fork() starts failing.

ps -eo pid,ppid,stat,cmd | awk '$3 ~ /Z/'

Signals¶

Signal	Purpose
`SIGHUP`	reload by convention
`SIGINT`	interactive interrupt
`SIGQUIT`	quit + core by default
`SIGTERM`	graceful termination
`SIGKILL`	uncatchable kill
`SIGSTOP`	uncatchable stop
`SIGCONT`	continue
`SIGCHLD`	child state changed

kill PID
kill -TERM PID
kill -HUP PID
kill -0 PID
kill -9 PID
pkill -f 'python app'

Operator rule: 1. inspect first 2. SIGTERM second 3. SIGKILL only when grace failed or the thing is obviously wedged

Part 5: Permissions¶

The base permission model¶

-rwxr-xr-- 1 deploy www-data 4096 Mar 23 14:00 app.py

file: r read, w write, x execute
directory: r list names, w create/delete entries, x traverse

chmod 755 file
chmod 644 file
chmod u+x file
chown user:group file

Special bits¶

Bit	Meaning
SUID	execute as file owner
SGID	execute as file group or inherit group on directory
sticky	only owner can delete entries in directory

find / -perm -4000 -ls 2>/dev/null
chmod g+s /srv/shared
chmod +t /tmp

umask¶

umask

Common values: - 0022 -> files 644, dirs 755 - 0002 -> files 664, dirs 775 - 0077 -> private by default

ACLs - when rwx is too blunt¶

Traditional mode bits are coarse. ACLs add per-user and per-group entries.

getfacl file
setfacl -m u:alice:r file
setfacl -m g:ops:rwX /srv/app
setfacl -d -m g:ops:rwX /srv/app

Use ACLs for shared directories and controlled exceptions. Do not turn them into a haunted forest of invisible permissions nobody remembers.

sudo and `visudo`¶

Do not hand-edit /etc/sudoers like a maniac with a flamethrower.

visudo
visudo -c

Prefer small files in /etc/sudoers.d/.

Example:

%wheel ALL=(ALL:ALL) ALL
ops ALL=(root) NOPASSWD: /usr/bin/systemctl restart nginx

Linux capabilities¶

Root used to mean almost all power. Capabilities split that power into smaller pieces.

Examples: - CAP_NET_BIND_SERVICE - bind to ports below 1024 - CAP_NET_ADMIN - network administration operations - CAP_SYS_TIME - set system clock - CAP_SYS_ADMIN - the kitchen-sink monster; avoid when possible

Inspect and set file capabilities:

getcap /path/to/binary
setcap cap_net_bind_service=+ep /usr/local/bin/myweb

Capabilities are great for least privilege. They are also a good way to create weird bugs if you do not understand effective, permitted, inheritable, and ambient sets.

MAC - SELinux and AppArmor¶

DAC says what the file owner and mode bits allow. MAC says what policy allows, regardless of owner intent.

SELinux is label-based and powerful.
AppArmor is path-based and usually easier to approach.

Quick checks:

# SELinux
getenforce
restorecon -Rv /var/www
ausearch -m avc -ts recent

# AppArmor
aa-status
apparmor_status

If a service gets EACCES but mode bits look fine, think MAC.

Part 6: The Filesystem¶

Everything is a file-ish thing¶

Regular files, directories, symlinks, block devices, character devices, sockets, pipes, procfs, sysfs - Linux represents a lot of system state through file-like interfaces.

Important paths¶

Path	Purpose
`/`	root
`/etc`	configuration
`/var`	variable data
`/home`	user homes
`/root`	root home
`/tmp`	temporary files
`/run`	runtime state, often tmpfs
`/proc`	process and kernel state
`/sys`	device and driver state
`/dev`	device nodes
`/boot`	kernel and bootloader assets
`/opt`	optional third-party software
`/srv`	site/service data

Inodes¶

An inode stores metadata: ownership, mode, timestamps, size, block pointers, and more. Filenames live in directory entries, not inodes.

ls -i file
stat file
df -i

When df -h says there is space but writes still fail, check: - df -i for inode exhaustion - read-only remounts - quotas - deleted-open-file leaks

Hard links vs symlinks¶

ln file hardlink
ln -s file symlink

hard link -> same inode, same filesystem only
symlink -> path reference, can cross filesystems, can dangle

VFS¶

The Virtual Filesystem layer lets the same syscalls work across ext4, XFS, tmpfs, NFS, overlayfs, procfs, and friends.

Common filesystem types¶

Filesystem	Best use
ext4	sane general-purpose default
XFS	big filesystems, high throughput, default on many RHEL systems
Btrfs	snapshots, checksums, compression, advanced features
tmpfs	RAM-backed temporary data
overlayfs	container layers
NFS	network file sharing

Part 7: Storage¶

Block devices and partitions¶

lsblk
lsblk -f
blkid
fdisk -l
parted -l
findmnt
cat /etc/fstab

LVM - storage virtualization that matters¶

physical disks -> physical volumes -> volume groups -> logical volumes -> filesystems

pvcreate /dev/sdb1
vgcreate data /dev/sdb1
lvcreate -L 50G -n app data
mkfs.ext4 /dev/data/app
mount /dev/data/app /srv/app

Growth example:

lvextend -L +20G /dev/data/app
resize2fs /dev/data/app           # ext4
xfs_growfs /srv/app               # XFS uses mountpoint

Resize caveats worth tattooing on your frontal lobe¶

ext4 can usually grow online; shrinking requires the filesystem to be unmounted.
XFS growth is easy; shrinking is generally not a normal operation to rely on.
Always understand the full stack: partition/LV size, then filesystem size, not just one layer.
Backups first. Heroic confidence after coffee is not a backup strategy.

RAID levels¶

RAID	Use
RAID 0	speed, zero redundancy
RAID 1	mirror
RAID 5	one-disk parity, rebuild risk rises with size
RAID 6	two-disk parity
RAID 10	mirror + stripe, great practical default for important write-heavy workloads

Software RAID basics¶

cat /proc/mdstat
mdadm --detail /dev/md0

When an array is degraded: - performance often drops - risk during rebuild rises - do not celebrate because it is “still up” - watch SMART data and rebuild progress

Disk health¶

smartctl -a /dev/sda
smartctl -t short /dev/sda
iostat -xz 1 5
iotop

Mount options that matter¶

Option	Use
`noexec`	block direct binary execution
`nosuid`	ignore SUID/SGID
`nodev`	ignore device nodes
`ro`	read-only
`noatime`	reduce access-time writes in some cases

LUKS - disk encryption basics¶

LUKS is the common Linux standard for block-device encryption.

cryptsetup luksFormat /dev/sdb1
cryptsetup open /dev/sdb1 secure_data
mkfs.ext4 /dev/mapper/secure_data

Files involved: - /etc/crypttab - what to unlock at boot - initramfs - often required for encrypted root

Backup the LUKS header when appropriate. Lose it and your encrypted data may become modern art.

Part 8: Memory Management¶

Big picture¶

Linux tries to use RAM aggressively. File cache is good. Empty RAM is mostly wasted opportunity.

free -h
cat /proc/meminfo | head -30

The field that usually matters most is MemAvailable, not MemFree.

Memory types¶

Type	Meaning
anonymous	heap, stack, private mappings
page cache	cached file data
slab	kernel object caches
shared/tmpfs	shared pages
kernel memory	kernel code and data

Virtual memory¶

Each process sees a virtual address space. The kernel maps that to physical memory. This gives isolation, lazy allocation, copy-on-write, and mmap-backed files.

cat /proc/PID/maps
cat /proc/PID/smaps_rollup
pmap PID

Swap¶

swapon --show
cat /proc/swaps
sysctl vm.swappiness

Swap is not evil. Blindly disabling swap everywhere is meme-ops. But sustained swapping means pressure exists and you should understand why.

OOM killer¶

dmesg -T | grep -i 'oom\|killed process'
journalctl -k -g 'oom\|Killed process'
cat /proc/PID/oom_score
cat /proc/PID/oom_score_adj

Useful idea: - if the kernel is killing things, the argument is already over - now you are doing forensics, not philosophy

Memory triage¶

free -h
vmstat 1 5
ps aux --sort=-%mem | head -15
slabtop

Part 9: Networking Fundamentals¶

Interfaces and addresses¶

ip addr show
ip -br addr
ip link show
ip route show
ip neigh show

Prefer ip over old ifconfig and route. The legacy commands still exist in many places, but the iproute2 tools are the modern interface.

DNS, NSS, and why `dig` is not the whole truth¶

There are multiple layers here: - /etc/hosts - /etc/nsswitch.conf - libc resolver behavior - systemd-resolved on many systems - /etc/resolv.conf - upstream DNS servers

So: - dig example.com asks DNS directly. - getent hosts example.com asks the system resolver path configured by NSS. - those are not the same test.

getent hosts example.com
dig example.com +short
resolvectl status
resolvectl query example.com
cat /etc/nsswitch.conf
ls -l /etc/resolv.conf

If a host resolves with dig but not with getent, the problem may be NSS, search domains, systemd-resolved, or /etc/hosts, not raw DNS reachability.

`/etc/resolv.conf` realities¶

On systems using systemd-resolved, /etc/resolv.conf may be: - a symlink to the stub resolver config using 127.0.0.53 - a symlink to a generated file listing upstream resolvers - a static file managed by something else

Do not assume it is a normal hand-edited file anymore.

Connectivity tests¶

ping host
tracepath host
traceroute host
nc -zv host 443
curl -v telnet://host:443
ss -tlnp
tcpdump -i eth0 port 443

TCP states worth knowing¶

State	Meaning	Common interpretation
`LISTEN`	waiting for inbound connections	normal for servers
`ESTAB`	connection active	normal
`TIME-WAIT`	recently closed	many short-lived connections
`CLOSE-WAIT`	peer closed, local side has not	application bug or leak
`SYN-SENT`	outbound connect in progress	upstream unreachable or filtered

Many CLOSE-WAIT sockets usually mean your application is failing to close descriptors after the peer has gone away.

Bridges, bonds, VLANs - the one-screen version¶

bridge - software L2 switch joining interfaces into one broadcast domain
bond/team - combine multiple NICs for redundancy or aggregated bandwidth
VLAN - isolate traffic at layer 2 using tagged networks

Quick examples:

bridge link
bridge vlan show
ip -d link show
cat /proc/net/bonding/bond0

If you work around virtualization, hypervisors, KVM, Proxmox, libvirt, or container hosts, bridges and VLANs stop being “advanced” and become Tuesday.

Policy routing and multiple tables¶

Sometimes the right route depends on source IP, mark, or interface. That is policy routing, not basic destination lookup.

ip rule show
ip route show table main
ip route show table all

If VPN, multihoming, or weird asymmetric paths are involved, look here.

Part 10: Firewalls¶

Linux firewalling today is nftables-first conceptually, even when older tools are still in circulation.

nftables mental model¶

tables hold chains
chains hold rules
rules match packets and take actions
one ruleset can cover IPv4 and IPv6 cleanly

Example host firewall:

nft list ruleset

Example config:

table inet filter {
  chain input {
    type filter hook input priority 0;
    policy drop;

    ct state established,related accept
    iif lo accept
    tcp dport { 22, 80, 443 } accept
    ip protocol icmp accept
    ip6 nexthdr icmpv6 accept
  }
}

Apply safely:

nft -f /etc/nftables.conf

iptables still matters¶

You will still see iptables because: - old docs never die - Docker, kube-proxy, fail2ban, and assorted tools still expose iptables-shaped behavior - many distributions ship iptables compatibility frontends backed by nftables underneath

Useful commands:

iptables -L -n -v --line-numbers
iptables -t nat -L -n -v
iptables-save

Prefer conntrack syntax over the older state match when writing new iptables rules:

iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT

Firewalld and UFW¶

firewalld is common on RHEL-family systems
UFW is common on Ubuntu
both are frontends, not the kernel firewall engine itself

firewall-cmd --state
firewall-cmd --get-active-zones
ufw status verbose

Rule ordering still matters¶

Whether nftables or iptables, careless DROP rules can lock you out. Keep your current connection safe before you get clever.

Part 11: SSH¶

What happens on connect¶

TCP connect to port 22
key exchange
host key verification
user authentication
channel/session setup

Host trust hygiene¶

SSH uses TOFU - trust on first use - unless you pre-seed trust another way.

Important files: - ~/.ssh/known_hosts - ~/.ssh/config - ~/.ssh/id_ed25519

Key types¶

Type	Advice
Ed25519	preferred general default
RSA	legacy compatibility
ECDSA	acceptable
DSA	dead, leave it buried

ssh-keygen -t ed25519 -C 'deploy@example'
ssh-copy-id user@host

Config file¶

Host prod-*
    User deploy
    IdentityFile ~/.ssh/deploy_ed25519
    ProxyJump bastion.example.com

Host db-primary
    HostName 10.0.2.50
    User postgres
    Port 2222

Tunnels¶

ssh -L 8080:localhost:80 user@remote
ssh -R 8080:localhost:3000 user@remote
ssh -D 1080 user@remote
ssh -J bastion internal-host

Agent forwarding¶

ssh -A bastion

Use sparingly. Root on the intermediate host can potentially abuse your forwarded agent. ProxyJump is often the cleaner answer.

Part 12: The `/proc` Filesystem¶

/proc is a virtual filesystem exposing kernel and process state.

Per-process inspection¶

cat /proc/$$/cmdline | tr '\0' ' '
ls -la /proc/$$/cwd
cat /proc/$$/environ | tr '\0' '\n' | head
cat /proc/$$/status
ls -la /proc/$$/fd
cat /proc/$$/maps

Secrets warning: environment variables are not a magical safe. Same-user or root access can often inspect them.

System-wide files¶

cat /proc/meminfo
cat /proc/cpuinfo
cat /proc/loadavg
cat /proc/uptime
cat /proc/sys/kernel/pid_max
cat /proc/net/tcp

Deleted open files¶

lsof +L1
find /proc/*/fd -ls 2>/dev/null | grep deleted

If a file is deleted but still open, disk space is not reclaimed until the process closes it or dies.

Part 13: Debugging with `strace`¶

strace shows syscalls. That is often enough to expose what a process is actually waiting on.

strace -p 12345
strace -p 12345 -t -T
strace -f ./deploy.sh
strace -e trace=file ./myapp
strace -e trace=network ./myapp

Patterns¶

Stuck process

strace -p PID
# blocked on read, connect, poll, futex, openat, etc.

Slow startup

strace -T -e trace=file,network ./myapp 2>&1 | sort -t'<' -k2 -rn | head

Permission denied

strace -e trace=file ./myapp 2>&1 | grep EACCES

strace is not subtle, but subtle is overrated at 3 AM.

Part 14: Performance Triage¶

USE method¶

For each resource, check: - Utilization - Saturation - Errors

Resource	Utilization	Saturation	Errors
CPU	`top`, `mpstat`	run queue	kernel or hardware complaints
Memory	`free -h`, `vmstat`	swap, reclaim, OOM	OOM logs
Disk	`iostat -xz`	`await`, queue depth	I/O errors
Network	`sar -n DEV`, `ip -s link`	drops, backlog, retransmits	driver/link errors

Quick triage chain¶

uptime
free -h
df -h
df -i
dmesg -T | tail -30
iostat -xz 1 3
ss -s
ps -eo pid,ppid,%cpu,%mem,stat,cmd --sort=-%cpu | head
ps -eo pid,ppid,%cpu,%mem,stat,cmd --sort=-%mem | head

Load average¶

Load is runnable tasks plus tasks stuck in uninterruptible sleep, usually I/O.

uptime
nproc

High load with low CPU usage often means I/O pain, not CPU pain.

Part 15: Logging¶

Common places¶

/var/log/syslog or /var/log/messages
/var/log/auth.log or secure
/var/log/kern.log
application logs under /var/log/<app>/

journald essentials¶

journalctl -u nginx -f
journalctl -u nginx --since '1 hour ago'
journalctl -b -p err
journalctl -k
journalctl --disk-usage
journalctl --vacuum-size=500M

logrotate¶

Example:

/var/log/myapp/*.log {
    daily
    rotate 14
    compress
    delaycompress
    missingok
    notifempty
    postrotate
        systemctl reload myapp
    endscript
}

Important modern nuance: - on some systems log rotation is driven by cron - on others it is driven by a systemd timer such as logrotate.timer - do not assume cron is the scheduler without checking

systemctl status logrotate.timer
systemctl list-timers | grep logrotate

Part 16: Package Management¶

Debian / Ubuntu¶

apt update
apt upgrade
apt install nginx
apt remove nginx
apt purge nginx
apt search nginx
apt-cache policy nginx
dpkg -l | grep nginx
dpkg -L nginx

RHEL / Fedora / Rocky / Alma¶

dnf install nginx
dnf upgrade
dnf info nginx
dnf remove nginx
dnf list installed | grep nginx
rpm -qa | grep nginx
rpm -ql nginx

Package hygiene¶

prefer vendor packages or known repositories over random curl-pipe installers
understand what created a file before you edit or delete it
config drift and package ownership matter

Part 17: Text Processing¶

Pipeline mindset¶

grep ' 500 ' access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head

Core tools¶

grep -r 'TODO' src/
grep -E 'error|warn' file
awk '{print $1}' file
awk -F: '{print $1,$7}' /etc/passwd
sed -n '10,20p' file
sed 's/old/new/g' file
sort -rn
uniq -c
cut -d: -f1 /etc/passwd
tr 'a-z' 'A-Z'
head -20 file
tail -f file
tee file
xargs

One nitpick worth keeping: avoid useless cat file | grep pattern when grep pattern file does the job. It is not a moral issue, just cleaner.

Part 18: cgroups and Namespaces¶

cgroup v2¶

Modern Linux increasingly means cgroup v2: a single unified hierarchy.

Useful checks:

mount | grep cgroup
stat -fc %T /sys/fs/cgroup
cat /proc/self/cgroup
systemd-cgls
systemd-cgtop

Common files on cgroup v2 systems:

cat /sys/fs/cgroup/cgroup.controllers
cat /sys/fs/cgroup/memory.current
cat /sys/fs/cgroup/memory.max
cat /sys/fs/cgroup/cpu.max

Per-service example:

systemctl show nginx -p ControlGroup
CG=$(systemctl show nginx -p ControlGroup --value)
cat /sys/fs/cgroup${CG}/memory.current

Namespaces¶

Namespace	Isolates
PID	process IDs
net	network stack
mount	mount table
UTS	hostname
user	UID/GID mappings
IPC	shared IPC objects
cgroup	cgroup view
time	time namespaces on supported systems

ls -la /proc/PID/ns
ip netns add test
ip netns exec test ip addr

Containers are mostly cgroups + namespaces + filesystem layering + runtime tooling.

Part 19: Security Hardening¶

SSH daemon baseline¶

PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers deploy admin
MaxAuthTries 3

Least privilege¶

run services as dedicated users
use capabilities when a narrow privilege is enough
use sudoers.d instead of handing out full root casually
restrict writable paths in systemd units

MAC controls¶

SELinux: powerful label-based enforcement
AppArmor: simpler path-based confinement on many Ubuntu systems

Firewall baseline¶

default deny inbound unless host role says otherwise
allow only required services
document exceptions
beware container/orchestrator interaction with host firewall rules

Kernel and sysctl hardening¶

Example baseline ideas:

net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.tcp_syncookies = 1
kernel.dmesg_restrict = 1
fs.protected_hardlinks = 1
fs.protected_symlinks = 1

Patching and provenance¶

keep the OS current
know which repos you trust
verify what owns a binary and where it came from
avoid mystery curl scripts unless you have reviewed them

Auditing¶

auditctl -w /etc/passwd -p wa -k passwd_changes
ausearch -k passwd_changes
last
lastb

Part 20: eBPF¶

eBPF lets you run verified sandboxed programs in the kernel for observability, networking, and security uses.

Examples:

bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
execsnoop
opensnoop
biolatency
tcpconnect

It is absurdly powerful. It is also not beginner-friendly when you leave the one-liner lane.

Part 21: Linux Distributions¶

Family	Examples	Package tools	Common defaults
Debian	Debian, Ubuntu	apt, dpkg	ext4, AppArmor often on Ubuntu
Red Hat	RHEL, Rocky, Alma, Fedora	dnf, rpm	XFS common, SELinux strong
SUSE	SLES, openSUSE	zypper, rpm	Btrfs common on root
Arch	Arch, Endeavour	pacman	rolling release
Alpine	Alpine	apk	musl, small footprint

Core Linux skills transfer. Packaging, defaults, release model, and support policies are where the families diverge.

Part 22: On-Call Survival Guide¶

Disk full¶

df -h
df -i
du -xhd1 /var | sort -h
lsof +L1
journalctl --disk-usage

OOM¶

dmesg -T | grep -i 'oom\|killed process'
free -h
ps aux --sort=-%mem | head -15

Service failed¶

systemctl status SERVICE
journalctl -u SERVICE -n 100 --no-pager
ss -tlnp | grep PORT
systemctl cat SERVICE

High load¶

uptime
nproc
iostat -xz 1 3
vmstat 1 5
top -bn1 | head -30

Safe vs dangerous¶

Usually safe	Usually dangerous
read logs and status	`kill -9` on business-critical daemons
inspect sockets, pids, mounts	deleting unknown files under pressure
collect evidence	rebooting before you know what happened
journal vacuum with intent	`docker system prune` in anger

Part 23: Real-World Case Studies¶

Case 1: OOM kills the app¶

Symptom: app dies, app logs say little.

Investigation: dmesg shows the kernel killed it. Heap or process memory budget assumed the host belonged entirely to one process.

Fix: reduce heap, add memory limits, add monitoring, leave headroom for kernel and cache.

Case 2: Disk “full” but `df` looks okay¶

Symptom: app still cannot write.

Investigation: df -i shows inode exhaustion or lsof +L1 shows giant deleted-open logs.

Fix: clean tiny-file storm or restart/rotate the offending process correctly.

Case 3: Zombie army¶

Symptom: fork() fails with EAGAIN.

Investigation: parent process is not reaping children. Zombies pile up.

Fix: fix the parent, restart it, or kill it so PID 1 adopts and reaps the zombies.

Case 4: Service flapping under systemd¶

Symptom: service restarts every few seconds and hits start limit.

Investigation: journalctl -u reveals bad config path, bad permissions, or missing dependency.

Fix: use absolute paths, correct WorkingDirectory, fix config, then systemctl reset-failed SERVICE.

Case 5: Logs eat root¶

Symptom: SSH slow, commands fail, temp files cannot be created.

Investigation: giant logs, failed rotation, or runaway debug mode.

Fix: truncate carefully if the file is open, repair rotation scheduling, consider separate /var.

Case 6: High load, low CPU¶

Symptom: load average huge, CPUs not pegged.

Investigation: iostat shows long await; tasks are stuck in I/O wait.

Fix: storage bottleneck, not CPU bottleneck. Different war, different tools.

Glossary¶

Term	Meaning
kernel	core of the operating system
syscall	userspace entry into kernel services
PID 1	init process, usually systemd
inode	file metadata record
file descriptor	numeric handle for open file/socket/pipe
page cache	RAM used for file caching
OOM killer	kernel logic that kills tasks under extreme memory exhaustion
cgroup	resource control grouping
namespace	isolation boundary
unit	systemd-managed object
target	systemd grouping / boot milestone
initramfs	temporary early root in RAM
ESP	EFI System Partition
LVM	Logical Volume Manager
LUKS	Linux block-device encryption standard
ACL	access control list
capability	fine-grained kernel privilege
MAC	mandatory access control
TOFU	trust on first use
NSS	Name Service Switch

Flashcards¶

Boot and kernel¶

Q	A
Boot chain in order?	firmware -> bootloader -> kernel -> initramfs -> PID 1
What is initramfs for?	early userspace needed to reach the real root filesystem
What happens if PID 1 exits?	kernel panic
`network.target` vs `network-online.target`?	startup marker vs actual wait-for-network target

Processes and permissions¶

Q	A
`SIGTERM` vs `SIGKILL`?	graceful request vs uncatchable kill
What is a zombie?	exited process not yet reaped
Why use ACLs?	mode bits are too coarse for some sharing needs
Why use capabilities?	narrow privileges instead of full root

Storage and memory¶

Q	A
ext4 shrink?	possible, usually offline
XFS shrink?	not something to plan your life around
`MemFree` or `MemAvailable`?	`MemAvailable`
What does `lsof +L1` find?	deleted files still open

Networking and security¶

Q	A
`dig` vs `getent hosts`?	raw DNS query vs system resolver/NSS path
Many `CLOSE-WAIT` sockets mean?	app is not closing connections
nftables or iptables first?	nftables first for modern mental model
What protects privileged port binding without full root?	`CAP_NET_BIND_SERVICE`

Drills¶

Drill 1: Read the local boot and init path¶

cat /proc/cmdline
ps -p 1 -o pid,comm,args
systemd-analyze
systemd-analyze critical-chain

Drill 2: Inspect a running process deeply¶

PID=$(pgrep -n sshd)
cat /proc/$PID/status
ls -la /proc/$PID/fd | head
cat /proc/$PID/cgroup

Drill 3: Compare DNS tools¶

dig example.com +short
getent hosts example.com
resolvectl query example.com

Explain why the answers may differ.

Drill 4: Find deleted open files¶

lsof +L1

Drill 5: Add a systemd override safely¶

systemctl edit nginx
systemctl daemon-reload
systemctl restart nginx
systemctl cat nginx

Drill 6: Inspect cgroup v2 data for a service¶

CG=$(systemctl show ssh -p ControlGroup --value)
echo "$CG"
cat /sys/fs/cgroup${CG}/memory.current
cat /sys/fs/cgroup${CG}/cpu.stat

Drill 7: Check ACLs and capabilities¶

getfacl /srv/shared
getcap -r /usr/local/bin /usr/bin 2>/dev/null | head

Drill 8: One-minute triage drill¶

Collect these with no commentary first:

uptime
free -h
df -h
df -i
dmesg -T | tail -20
ss -s

Then write a three-sentence diagnosis hypothesis.

Cheat Sheet¶

Process and service control¶

ps aux
pstree -p
kill -TERM PID
kill -9 PID
systemctl status SERVICE
journalctl -u SERVICE -n 50 --no-pager

Disk and memory¶

df -h
df -i
du -xhd1 /var | sort -h
free -h
vmstat 1 5
lsof +L1

Network and DNS¶

ip -br addr
ip route
ss -tlnp
getent hosts name
dig name +short
resolvectl status

Firewall¶

nft list ruleset
iptables -L -n -v --line-numbers
firewall-cmd --list-all
ufw status verbose

Storage¶

lsblk -f
blkid
findmnt
cat /proc/mdstat
pvs && vgs && lvs

Quick triage chain¶

systemctl status -> journalctl -> ss -> df -h / df -i -> free -h -> dmesg -> iostat/vmstat

Self-Assessment¶

Notes on Scope¶

This guide intentionally corrected and modernized several common Linux-teaching mistakes: - it treats nftables as the modern firewall model, while still covering legacy iptables - it treats cgroup v2 as the modern baseline - it distinguishes network.target from network-online.target - it separates DNS testing from system name-resolution testing - it includes ACLs, capabilities, sudo hygiene, AppArmor/SELinux, LUKS, and storage resize caveats that broad “Linux complete guides” often skip

That makes it less flashy than a “one doc explains literally everything forever” claim, but far more trustworthy.

Verification Notes¶

Modernized sections in this revision were checked against current upstream or vendor documentation for these areas: - systemd service ordering and network-online.target behavior - cgroup v2 unified hierarchy - nftables as the modern Netfilter framework and iptables compatibility layers - distro-specific GRUB regeneration workflows - Secure Boot chain details, including initrd nuance and module validation - systemd-resolved, resolvectl, and /etc/resolv.conf modes - AppArmor, SELinux, ACL, capability, and visudo behavior - logrotate scheduling via systemd timers on modern systems

That does not make every sentence timeless. Linux changes. But it removes the obvious stale landmines from the prior draft.

Linux - Foundations and Operations Guide¶

The Mission¶

Table of Contents¶

Part 1: The Boot Sequence¶

Stage 1: Firmware - BIOS or UEFI¶

Stage 2: Bootloader - usually GRUB¶

Stage 3: Kernel Initialization¶

Stage 4: Initramfs - the bridge to the real root¶

Stage 5: PID 1 takes over¶

Part 2: The Kernel¶

What Linux actually is¶

Key kernel concepts¶

Part 3: systemd¶

Essential commands¶

Units that matter most¶

A sane service file¶

Dependency semantics that trip people¶

Drop-in overrides¶

Timers¶

journald¶

Useful recovery targets¶

Part 4: Processes and Signals¶

Process lifecycle¶

Process states¶

Signals¶

Part 5: Permissions¶

The base permission model¶

Special bits¶

umask¶

ACLs - when rwx is too blunt¶

sudo and visudo¶

Linux capabilities¶

MAC - SELinux and AppArmor¶

Part 6: The Filesystem¶

Everything is a file-ish thing¶

Important paths¶

Inodes¶

Hard links vs symlinks¶

VFS¶

Common filesystem types¶

Part 7: Storage¶

Block devices and partitions¶

LVM - storage virtualization that matters¶

Resize caveats worth tattooing on your frontal lobe¶

RAID levels¶

Software RAID basics¶

Disk health¶

Mount options that matter¶

LUKS - disk encryption basics¶

Part 8: Memory Management¶

Big picture¶

Memory types¶

Virtual memory¶

Swap¶

OOM killer¶

Memory triage¶

Part 9: Networking Fundamentals¶

Interfaces and addresses¶

DNS, NSS, and why dig is not the whole truth¶

/etc/resolv.conf realities¶

Connectivity tests¶

TCP states worth knowing¶

Bridges, bonds, VLANs - the one-screen version¶

Policy routing and multiple tables¶

Part 10: Firewalls¶

nftables mental model¶

iptables still matters¶

Firewalld and UFW¶

Rule ordering still matters¶

Part 11: SSH¶

What happens on connect¶

Host trust hygiene¶

Key types¶

Config file¶

Tunnels¶

Agent forwarding¶

Part 12: The /proc Filesystem¶

Per-process inspection¶

System-wide files¶

Deleted open files¶

sudo and `visudo`¶

DNS, NSS, and why `dig` is not the whole truth¶

`/etc/resolv.conf` realities¶

Part 12: The `/proc` Filesystem¶

Part 13: Debugging with `strace`¶

Case 2: Disk “full” but `df` looks okay¶