Skip to content

Linux Boot Process — Footguns & Pitfalls

These are the mistakes that turn a simple reboot into a multi-hour emergency. Every one of these has bricked production servers. Learn them so you don't repeat them.


Editing /etc/fstab Without Testing

The footgun: Changing /etc/fstab and rebooting without verifying the changes.

# Someone adds a new mount to fstab:
UUID=wrong-uuid-here    /mnt/data    ext4    defaults    0    2

# They reboot. The UUID doesn't exist (typo). The system drops to emergency mode
# because systemd can't mount the filesystem and the default is to fail.

Why it's devastating: - On remote servers, you can't access the GRUB menu or emergency shell without IPMI/console - Cloud instances may require detaching/reattaching the root volume from another instance - The entire team is blocked while someone scrambles for console access

Prevention — always test before rebooting:

# After editing fstab, verify ALL entries:
$ sudo mount -a
# If this succeeds with no errors, it's safe to reboot

# Also validate the syntax:
$ sudo findmnt --verify --tab-file /etc/fstab

# For entries that depend on network (NFS, iSCSI), use nofail:
UUID=abc123    /mnt/nfs    nfs    defaults,nofail,_netdev    0    0
# nofail means: if mount fails, continue booting anyway
# _netdev means: wait for network before trying to mount

The nofail option is your safety net. For any non-critical mount, add nofail so the system boots even if the mount fails.


Deleting Old Kernels Without Keeping a Fallback

The footgun: Aggressively removing old kernels to save space in /boot, leaving only one kernel. Then that kernel has a problem.

# "I'll clean up /boot"
$ sudo apt-get purge linux-image-5.15.0-{85,86,87,88,89,90}-generic
$ sudo apt-get purge linux-image-5.15.0-91-generic   # This was the only working one

# Now the only installed kernel is 5.15.0-92-generic, which has a driver regression
# that prevents your RAID controller from being detected.
# System won't boot. No fallback kernel in GRUB.

Prevention:

# Always keep at least TWO kernels: current + one known-good fallback
$ dpkg --list 'linux-image-*' | grep '^ii'
# Make sure there are at least 2 entries

# On RHEL, set the retention limit:
# /etc/dnf/dnf.conf
installonly_limit=3    # Keep 3 kernels

# On Debian, autoremove handles this:
$ sudo apt-get autoremove    # Safely removes old kernels, keeps current + one

# After a kernel update, TEST the new kernel before removing the old one:
# 1. Reboot into new kernel
# 2. Verify everything works (storage, networking, GPU, etc.)
# 3. Only THEN remove old kernels


Filling the /boot Partition

The footgun: Not monitoring the /boot partition size. Kernel updates accumulate until /boot is full, then apt upgrade or yum update fails, sometimes leaving a half-installed kernel.

$ df -h /boot
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       477M  475M     0 100% /boot

$ sudo apt-get upgrade
E: Could not write to /boot/initrd.img-5.15.0-93-generic - No space left on device
dpkg: error processing package linux-image-5.15.0-93-generic (--configure):
 installed linux-image-5.15.0-93-generic package post-installation script subprocess returned error exit status 1

Why it happens: - /boot is often a small separate partition (256-512 MB) - Each kernel takes ~50-100 MB (vmlinuz + initrd + System.map + config) - After 5-6 kernel updates without cleanup, it fills up - Automated updates might stop working silently

Prevention:

# Monitor /boot usage
$ df -h /boot

# Set up automatic kernel cleanup
# Debian/Ubuntu: ensure unattended-upgrades removes old kernels
$ cat /etc/apt/apt.conf.d/50unattended-upgrades
Unattended-Upgrade::Remove-Unused-Kernel-Packages "true";
Unattended-Upgrade::Remove-New-Unused-Dependencies "true";

# Or create a simple cron check:
$ cat > /etc/cron.daily/check-boot-space << 'EOF'
#!/bin/bash
USAGE=$(df /boot --output=pcent | tail -1 | tr -d '% ')
if [ "$USAGE" -gt 80 ]; then
    echo "/boot is ${USAGE}% full on $(hostname)" | mail -s "ALERT: /boot space" ops@example.com
fi
EOF
$ chmod +x /etc/cron.daily/check-boot-space

Recovery when /boot is full: See the Street Ops section on filling /boot partition.


Bad GRUB Config Without Regenerating

The footgun: Manually editing /boot/grub/grub.cfg instead of editing /etc/default/grub and running update-grub.

# Someone hand-edits grub.cfg:
$ sudo vim /boot/grub/grub.cfg
# They make a change that seems fine...

# Next kernel update runs update-grub, which OVERWRITES grub.cfg
# from the template files in /etc/grub.d/ and settings in /etc/default/grub
# The manual edit is gone. If it was important, things break.

# Or worse: the manual edit introduces a syntax error
# and now GRUB can't parse its config at all

The correct approach:

# 1. Edit settings:
$ sudo vim /etc/default/grub

# 2. For custom entries, create a script in /etc/grub.d/:
$ sudo vim /etc/grub.d/40_custom

# 3. Regenerate:
$ sudo update-grub              # Debian/Ubuntu
$ sudo grub2-mkconfig -o /boot/grub2/grub.cfg   # RHEL

# 4. Verify the generated config looks right:
$ grep menuentry /boot/grub/grub.cfg


initramfs Missing Critical Drivers

The footgun: Updating the kernel or storage configuration without regenerating initramfs, or regenerating it without the necessary drivers.

# Scenario 1: You move root to a new storage controller (e.g., from SATA to NVMe)
# but forget to rebuild initramfs with the NVMe driver.
# Kernel boots, can't find root filesystem -> panic

# Scenario 2: You update dracut config to exclude "unnecessary" modules
# /etc/dracut.conf.d/slim.conf
omit_drivers+=" megaraid_sas mpt3sas "
# Rebuild initramfs, reboot... kernel can't see your RAID array

# Scenario 3: You switch root filesystem from ext4 to XFS
# but initramfs still only has ext4 module

Prevention:

# After any storage change, rebuild initramfs:
$ sudo update-initramfs -u      # Debian
$ sudo dracut -f                # RHEL

# Verify the initramfs has the drivers you need:
$ lsinitramfs /boot/initrd.img-$(uname -r) | grep -i "nvme\|megaraid\|xfs"

# Before rebooting into a new kernel, verify initramfs exists and is non-empty:
$ ls -la /boot/initrd.img-5.15.0-93-generic
-rw-r--r-- 1 root root 67108864 Mar 19 10:00 /boot/initrd.img-5.15.0-93-generic
# If the file is tiny (< 1 MB) or missing, DO NOT REBOOT

# Dracut: check what modules are included:
$ lsinitrd /boot/initramfs-$(uname -r).img | head -30


Changing the Default systemd Target Without Understanding Dependencies

The footgun: Setting the default target to something that doesn't include networking or SSH.

# "I don't need a GUI on this server"
$ sudo systemctl set-default multi-user.target
# This is usually fine — multi-user.target includes networking and SSH

# But what if someone does:
$ sudo systemctl set-default rescue.target
# Rescue mode requires physical console access
# Remote servers are now unreachable after reboot

# Or:
$ sudo systemctl isolate emergency.target
# Emergency mode: minimal services, no networking, no SSH
# On a remote server, this is instant loss of access

Prevention:

# For servers, the correct target is almost always multi-user.target
$ sudo systemctl set-default multi-user.target

# Before isolating to a different target, understand what it includes:
$ systemctl list-dependencies rescue.target
# Note: no network.target, no sshd.service

# If you need to change targets for maintenance on a remote server,
# use a timed revert:
$ sudo systemctl set-default rescue.target
$ echo "systemctl set-default multi-user.target" | at now + 10 minutes
# If something goes wrong, the system reverts in 10 minutes
$ sudo reboot


Disabling Services You Don't Understand

The footgun: Disabling systemd services to "speed up boot" or "harden the system" without understanding their dependencies.

# "I don't need this, it slows down boot"
$ sudo systemctl disable systemd-udevd.service
# Hardware detection won't work. Enjoy no storage/network.

$ sudo systemctl disable systemd-journald.service
# All logging stops. You'll have zero diagnostics when things break.

$ sudo systemctl disable dbus.service
# D-Bus is the system message bus. NetworkManager, systemd, polkit — all broken.

Safe boot optimization:

# First, identify what's actually slow:
$ systemd-analyze blame | head -15

# Only disable services you understand and know aren't needed:
$ sudo systemctl disable snapd.service          # If you don't use snaps
$ sudo systemctl disable ModemManager.service    # Servers don't have modems
$ sudo systemctl disable bluetooth.service       # Servers don't need bluetooth
$ sudo systemctl disable cups.service            # Print service on a server? No.

# NEVER disable without checking dependencies:
$ systemctl list-dependencies --reverse systemd-udevd.service
# Shows what depends on this service


UEFI Boot Entry Corruption

The footgun: Manually editing UEFI NVRAM variables or the ESP without understanding the boot entry structure.

# Removing all boot entries:
$ sudo efibootmgr -b 0001 -B
$ sudo efibootmgr -b 0002 -B
# Now there are no boot entries. Firmware has nothing to boot.

# Or deleting files from ESP:
$ sudo rm -rf /boot/efi/EFI/ubuntu/
# UEFI firmware can't find the bootloader.

Recovery:

# Boot from live USB, then:
$ sudo mount /dev/sda1 /mnt/boot/efi
$ sudo mount /dev/sda2 /mnt
$ for fs in dev proc sys run; do sudo mount --bind /$fs /mnt/$fs; done
$ sudo chroot /mnt

# Reinstall GRUB and recreate boot entry
$ grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=ubuntu
$ update-grub

# Or just recreate the boot entry:
$ efibootmgr --create --disk /dev/sda --part 1 --loader /EFI/ubuntu/shimx64.efi --label "Ubuntu"


Kernel Command Line Parameter Mistakes

The footgun: Adding kernel parameters to /etc/default/grub with typos or wrong values, then rebooting.

# Typo in root device:
GRUB_CMDLINE_LINUX="root=/dev/sda3"    # Should be sda2
# System can't find root -> kernel panic

# Setting console to a non-existent serial port:
GRUB_CMDLINE_LINUX="console=ttyS1,115200"
# All boot output goes to a serial port that doesn't exist
# Console is blank, but system might actually be running

# Accidentally adding "quiet" when debugging boot issues:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
# You're trying to see boot messages but they're hidden

Prevention:

# Test parameter changes by editing GRUB at boot time first:
# Press 'e' at GRUB menu, modify the linux line, Ctrl+X to boot
# If it works, THEN make the change permanent in /etc/default/grub

# Always regenerate GRUB config after changes:
$ sudo update-grub

# Verify the generated config has your changes:
$ grep CMDLINE /boot/grub/grub.cfg | head -5


Reboot Loops from systemd Unit Failures

The footgun: A systemd unit configured with OnFailure=reboot.target or FailureAction=reboot that keeps failing.

# A "critical" service configured to reboot on failure:
[Unit]
Description=Critical Service
OnFailure=reboot.target

[Service]
ExecStart=/opt/broken-app/start.sh
# The app crashes immediately every time
# The system reboots, app crashes, system reboots, app crashes...
# Infinite reboot loop

Prevention:

# Use restart limits instead of reboot triggers:
[Service]
Restart=on-failure
RestartSec=5s
StartLimitIntervalSec=300
StartLimitBurst=5
# After 5 failures in 300 seconds, give up (don't reboot)

# If you must have reboot-on-failure, add a delay and limit:
[Unit]
OnFailure=reboot.target
StartLimitAction=none    # Don't take action on start limit, just stop trying

Recovery from reboot loop: 1. At GRUB menu, press e and add systemd.unit=rescue.target to kernel line 2. Boot to rescue mode 3. Disable the problematic service: systemctl disable broken-app.service 4. Reboot normally 5. Fix the service, then re-enable