Linux Boot Process — Footguns & Pitfalls¶

These are the mistakes that turn a simple reboot into a multi-hour emergency. Every one of these has bricked production servers. Learn them so you don't repeat them.

Editing /etc/fstab Without Testing¶

The footgun: Changing /etc/fstab and rebooting without verifying the changes.

# Someone adds a new mount to fstab:
UUID=wrong-uuid-here    /mnt/data    ext4    defaults    0    2

# They reboot. The UUID doesn't exist (typo). The system drops to emergency mode
# because systemd can't mount the filesystem and the default is to fail.

Why it's devastating: - On remote servers, you can't access the GRUB menu or emergency shell without IPMI/console - Cloud instances may require detaching/reattaching the root volume from another instance - The entire team is blocked while someone scrambles for console access

Prevention — always test before rebooting:

# After editing fstab, verify ALL entries:
$ sudo mount -a
# If this succeeds with no errors, it's safe to reboot

# Also validate the syntax:
$ sudo findmnt --verify --tab-file /etc/fstab

# For entries that depend on network (NFS, iSCSI), use nofail:
UUID=abc123    /mnt/nfs    nfs    defaults,nofail,_netdev    0    0
# nofail means: if mount fails, continue booting anyway
# _netdev means: wait for network before trying to mount

The nofail option is your safety net. For any non-critical mount, add nofail so the system boots even if the mount fails.

Deleting Old Kernels Without Keeping a Fallback¶

The footgun: Aggressively removing old kernels to save space in /boot, leaving only one kernel. Then that kernel has a problem.

# "I'll clean up /boot"
$ sudo apt-get purge linux-image-5.15.0-{85,86,87,88,89,90}-generic
$ sudo apt-get purge linux-image-5.15.0-91-generic   # This was the only working one

# Now the only installed kernel is 5.15.0-92-generic, which has a driver regression
# that prevents your RAID controller from being detected.
# System won't boot. No fallback kernel in GRUB.

Prevention:

# Always keep at least TWO kernels: current + one known-good fallback
$ dpkg --list 'linux-image-*' | grep '^ii'
# Make sure there are at least 2 entries

# On RHEL, set the retention limit:
# /etc/dnf/dnf.conf
installonly_limit=3    # Keep 3 kernels

# On Debian, autoremove handles this:
$ sudo apt-get autoremove    # Safely removes old kernels, keeps current + one

# After a kernel update, TEST the new kernel before removing the old one:
# 1. Reboot into new kernel
# 2. Verify everything works (storage, networking, GPU, etc.)
# 3. Only THEN remove old kernels

Filling the /boot Partition¶

The footgun: Not monitoring the /boot partition size. Kernel updates accumulate until /boot is full, then apt upgrade or yum update fails, sometimes leaving a half-installed kernel.

$ df -h /boot
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       477M  475M     0 100% /boot

$ sudo apt-get upgrade
E: Could not write to /boot/initrd.img-5.15.0-93-generic - No space left on device
dpkg: error processing package linux-image-5.15.0-93-generic (--configure):
 installed linux-image-5.15.0-93-generic package post-installation script subprocess returned error exit status 1

Why it happens: - /boot is often a small separate partition (256-512 MB) - Each kernel takes ~50-100 MB (vmlinuz + initrd + System.map + config) - After 5-6 kernel updates without cleanup, it fills up - Automated updates might stop working silently

Prevention:

# Monitor /boot usage
$ df -h /boot

# Set up automatic kernel cleanup
# Debian/Ubuntu: ensure unattended-upgrades removes old kernels
$ cat /etc/apt/apt.conf.d/50unattended-upgrades
Unattended-Upgrade::Remove-Unused-Kernel-Packages "true";
Unattended-Upgrade::Remove-New-Unused-Dependencies "true";

# Or create a simple cron check:
$ cat > /etc/cron.daily/check-boot-space << 'EOF'
#!/bin/bash
USAGE=$(df /boot --output=pcent | tail -1 | tr -d '% ')
if [ "$USAGE" -gt 80 ]; then
    echo "/boot is ${USAGE}% full on $(hostname)" | mail -s "ALERT: /boot space" ops@example.com
fi
EOF
$ chmod +x /etc/cron.daily/check-boot-space

Recovery when /boot is full: See the Street Ops section on filling /boot partition.

Bad GRUB Config Without Regenerating¶

The footgun: Manually editing /boot/grub/grub.cfg instead of editing /etc/default/grub and running update-grub.

# Someone hand-edits grub.cfg:
$ sudo vim /boot/grub/grub.cfg
# They make a change that seems fine...

# Next kernel update runs update-grub, which OVERWRITES grub.cfg
# from the template files in /etc/grub.d/ and settings in /etc/default/grub
# The manual edit is gone. If it was important, things break.

# Or worse: the manual edit introduces a syntax error
# and now GRUB can't parse its config at all

The correct approach:

# 1. Edit settings:
$ sudo vim /etc/default/grub

# 2. For custom entries, create a script in /etc/grub.d/:
$ sudo vim /etc/grub.d/40_custom

# 3. Regenerate:
$ sudo update-grub              # Debian/Ubuntu
$ sudo grub2-mkconfig -o /boot/grub2/grub.cfg   # RHEL

# 4. Verify the generated config looks right:
$ grep menuentry /boot/grub/grub.cfg

initramfs Missing Critical Drivers¶

The footgun: Updating the kernel or storage configuration without regenerating initramfs, or regenerating it without the necessary drivers.

# Scenario 1: You move root to a new storage controller (e.g., from SATA to NVMe)
# but forget to rebuild initramfs with the NVMe driver.
# Kernel boots, can't find root filesystem -> panic

# Scenario 2: You update dracut config to exclude "unnecessary" modules
# /etc/dracut.conf.d/slim.conf
omit_drivers+=" megaraid_sas mpt3sas "
# Rebuild initramfs, reboot... kernel can't see your RAID array

# Scenario 3: You switch root filesystem from ext4 to XFS
# but initramfs still only has ext4 module

Prevention:

# After any storage change, rebuild initramfs:
$ sudo update-initramfs -u      # Debian
$ sudo dracut -f                # RHEL

# Verify the initramfs has the drivers you need:
$ lsinitramfs /boot/initrd.img-$(uname -r) | grep -i "nvme\|megaraid\|xfs"

# Before rebooting into a new kernel, verify initramfs exists and is non-empty:
$ ls -la /boot/initrd.img-5.15.0-93-generic
-rw-r--r-- 1 root root 67108864 Mar 19 10:00 /boot/initrd.img-5.15.0-93-generic
# If the file is tiny (< 1 MB) or missing, DO NOT REBOOT

# Dracut: check what modules are included:
$ lsinitrd /boot/initramfs-$(uname -r).img | head -30

Changing the Default systemd Target Without Understanding Dependencies¶

The footgun: Setting the default target to something that doesn't include networking or SSH.

# "I don't need a GUI on this server"
$ sudo systemctl set-default multi-user.target
# This is usually fine — multi-user.target includes networking and SSH

# But what if someone does:
$ sudo systemctl set-default rescue.target
# Rescue mode requires physical console access
# Remote servers are now unreachable after reboot

# Or:
$ sudo systemctl isolate emergency.target
# Emergency mode: minimal services, no networking, no SSH
# On a remote server, this is instant loss of access

Prevention:

# For servers, the correct target is almost always multi-user.target
$ sudo systemctl set-default multi-user.target

# Before isolating to a different target, understand what it includes:
$ systemctl list-dependencies rescue.target
# Note: no network.target, no sshd.service

# If you need to change targets for maintenance on a remote server,
# use a timed revert:
$ sudo systemctl set-default rescue.target
$ echo "systemctl set-default multi-user.target" | at now + 10 minutes
# If something goes wrong, the system reverts in 10 minutes
$ sudo reboot

Disabling Services You Don't Understand¶

The footgun: Disabling systemd services to "speed up boot" or "harden the system" without understanding their dependencies.

# "I don't need this, it slows down boot"
$ sudo systemctl disable systemd-udevd.service
# Hardware detection won't work. Enjoy no storage/network.

$ sudo systemctl disable systemd-journald.service
# All logging stops. You'll have zero diagnostics when things break.

$ sudo systemctl disable dbus.service
# D-Bus is the system message bus. NetworkManager, systemd, polkit — all broken.

Safe boot optimization:

# First, identify what's actually slow:
$ systemd-analyze blame | head -15

# Only disable services you understand and know aren't needed:
$ sudo systemctl disable snapd.service          # If you don't use snaps
$ sudo systemctl disable ModemManager.service    # Servers don't have modems
$ sudo systemctl disable bluetooth.service       # Servers don't need bluetooth
$ sudo systemctl disable cups.service            # Print service on a server? No.

# NEVER disable without checking dependencies:
$ systemctl list-dependencies --reverse systemd-udevd.service
# Shows what depends on this service

UEFI Boot Entry Corruption¶

The footgun: Manually editing UEFI NVRAM variables or the ESP without understanding the boot entry structure.

# Removing all boot entries:
$ sudo efibootmgr -b 0001 -B
$ sudo efibootmgr -b 0002 -B
# Now there are no boot entries. Firmware has nothing to boot.

# Or deleting files from ESP:
$ sudo rm -rf /boot/efi/EFI/ubuntu/
# UEFI firmware can't find the bootloader.

Recovery:

# Boot from live USB, then:
$ sudo mount /dev/sda1 /mnt/boot/efi
$ sudo mount /dev/sda2 /mnt
$ for fs in dev proc sys run; do sudo mount --bind /$fs /mnt/$fs; done
$ sudo chroot /mnt

# Reinstall GRUB and recreate boot entry
$ grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=ubuntu
$ update-grub

# Or just recreate the boot entry:
$ efibootmgr --create --disk /dev/sda --part 1 --loader /EFI/ubuntu/shimx64.efi --label "Ubuntu"

Kernel Command Line Parameter Mistakes¶

The footgun: Adding kernel parameters to /etc/default/grub with typos or wrong values, then rebooting.

# Typo in root device:
GRUB_CMDLINE_LINUX="root=/dev/sda3"    # Should be sda2
# System can't find root -> kernel panic

# Setting console to a non-existent serial port:
GRUB_CMDLINE_LINUX="console=ttyS1,115200"
# All boot output goes to a serial port that doesn't exist
# Console is blank, but system might actually be running

# Accidentally adding "quiet" when debugging boot issues:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
# You're trying to see boot messages but they're hidden

Prevention:

# Test parameter changes by editing GRUB at boot time first:
# Press 'e' at GRUB menu, modify the linux line, Ctrl+X to boot
# If it works, THEN make the change permanent in /etc/default/grub

# Always regenerate GRUB config after changes:
$ sudo update-grub

# Verify the generated config has your changes:
$ grep CMDLINE /boot/grub/grub.cfg | head -5

Reboot Loops from systemd Unit Failures¶

The footgun: A systemd unit configured with OnFailure=reboot.target or FailureAction=reboot that keeps failing.

# A "critical" service configured to reboot on failure:
[Unit]
Description=Critical Service
OnFailure=reboot.target

[Service]
ExecStart=/opt/broken-app/start.sh
# The app crashes immediately every time
# The system reboots, app crashes, system reboots, app crashes...
# Infinite reboot loop

Prevention:

# Use restart limits instead of reboot triggers:
[Service]
Restart=on-failure
RestartSec=5s
StartLimitIntervalSec=300
StartLimitBurst=5
# After 5 failures in 300 seconds, give up (don't reboot)

# If you must have reboot-on-failure, add a delay and limit:
[Unit]
OnFailure=reboot.target
StartLimitAction=none    # Don't take action on start limit, just stop trying

Recovery from reboot loop: 1. At GRUB menu, press e and add systemd.unit=rescue.target to kernel line 2. Boot to rescue mode 3. Disable the problematic service: systemctl disable broken-app.service 4. Reboot normally 5. Fix the service, then re-enable

Linux Boot Process — Footguns & Pitfalls¶

Editing /etc/fstab Without Testing¶

Deleting Old Kernels Without Keeping a Fallback¶

Filling the /boot Partition¶

Bad GRUB Config Without Regenerating¶

initramfs Missing Critical Drivers¶

Changing the Default systemd Target Without Understanding Dependencies¶

Disabling Services You Don't Understand¶

UEFI Boot Entry Corruption¶

Kernel Command Line Parameter Mistakes¶

Reboot Loops from systemd Unit Failures¶

Pages that link here¶