Linux Boot Process — Footguns & Pitfalls¶
These are the mistakes that turn a simple reboot into a multi-hour emergency. Every one of these has bricked production servers. Learn them so you don't repeat them.
Editing /etc/fstab Without Testing¶
The footgun: Changing /etc/fstab and rebooting without verifying the changes.
# Someone adds a new mount to fstab:
UUID=wrong-uuid-here /mnt/data ext4 defaults 0 2
# They reboot. The UUID doesn't exist (typo). The system drops to emergency mode
# because systemd can't mount the filesystem and the default is to fail.
Why it's devastating: - On remote servers, you can't access the GRUB menu or emergency shell without IPMI/console - Cloud instances may require detaching/reattaching the root volume from another instance - The entire team is blocked while someone scrambles for console access
Prevention — always test before rebooting:
# After editing fstab, verify ALL entries:
$ sudo mount -a
# If this succeeds with no errors, it's safe to reboot
# Also validate the syntax:
$ sudo findmnt --verify --tab-file /etc/fstab
# For entries that depend on network (NFS, iSCSI), use nofail:
UUID=abc123 /mnt/nfs nfs defaults,nofail,_netdev 0 0
# nofail means: if mount fails, continue booting anyway
# _netdev means: wait for network before trying to mount
The nofail option is your safety net. For any non-critical mount, add nofail so the system boots even if the mount fails.
Deleting Old Kernels Without Keeping a Fallback¶
The footgun: Aggressively removing old kernels to save space in /boot, leaving only one kernel. Then that kernel has a problem.
# "I'll clean up /boot"
$ sudo apt-get purge linux-image-5.15.0-{85,86,87,88,89,90}-generic
$ sudo apt-get purge linux-image-5.15.0-91-generic # This was the only working one
# Now the only installed kernel is 5.15.0-92-generic, which has a driver regression
# that prevents your RAID controller from being detected.
# System won't boot. No fallback kernel in GRUB.
Prevention:
# Always keep at least TWO kernels: current + one known-good fallback
$ dpkg --list 'linux-image-*' | grep '^ii'
# Make sure there are at least 2 entries
# On RHEL, set the retention limit:
# /etc/dnf/dnf.conf
installonly_limit=3 # Keep 3 kernels
# On Debian, autoremove handles this:
$ sudo apt-get autoremove # Safely removes old kernels, keeps current + one
# After a kernel update, TEST the new kernel before removing the old one:
# 1. Reboot into new kernel
# 2. Verify everything works (storage, networking, GPU, etc.)
# 3. Only THEN remove old kernels
Filling the /boot Partition¶
The footgun: Not monitoring the /boot partition size. Kernel updates accumulate until /boot is full, then apt upgrade or yum update fails, sometimes leaving a half-installed kernel.
$ df -h /boot
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 477M 475M 0 100% /boot
$ sudo apt-get upgrade
E: Could not write to /boot/initrd.img-5.15.0-93-generic - No space left on device
dpkg: error processing package linux-image-5.15.0-93-generic (--configure):
installed linux-image-5.15.0-93-generic package post-installation script subprocess returned error exit status 1
Why it happens:
- /boot is often a small separate partition (256-512 MB)
- Each kernel takes ~50-100 MB (vmlinuz + initrd + System.map + config)
- After 5-6 kernel updates without cleanup, it fills up
- Automated updates might stop working silently
Prevention:
# Monitor /boot usage
$ df -h /boot
# Set up automatic kernel cleanup
# Debian/Ubuntu: ensure unattended-upgrades removes old kernels
$ cat /etc/apt/apt.conf.d/50unattended-upgrades
Unattended-Upgrade::Remove-Unused-Kernel-Packages "true";
Unattended-Upgrade::Remove-New-Unused-Dependencies "true";
# Or create a simple cron check:
$ cat > /etc/cron.daily/check-boot-space << 'EOF'
#!/bin/bash
USAGE=$(df /boot --output=pcent | tail -1 | tr -d '% ')
if [ "$USAGE" -gt 80 ]; then
echo "/boot is ${USAGE}% full on $(hostname)" | mail -s "ALERT: /boot space" ops@example.com
fi
EOF
$ chmod +x /etc/cron.daily/check-boot-space
Recovery when /boot is full: See the Street Ops section on filling /boot partition.
Bad GRUB Config Without Regenerating¶
The footgun: Manually editing /boot/grub/grub.cfg instead of editing /etc/default/grub and running update-grub.
# Someone hand-edits grub.cfg:
$ sudo vim /boot/grub/grub.cfg
# They make a change that seems fine...
# Next kernel update runs update-grub, which OVERWRITES grub.cfg
# from the template files in /etc/grub.d/ and settings in /etc/default/grub
# The manual edit is gone. If it was important, things break.
# Or worse: the manual edit introduces a syntax error
# and now GRUB can't parse its config at all
The correct approach:
# 1. Edit settings:
$ sudo vim /etc/default/grub
# 2. For custom entries, create a script in /etc/grub.d/:
$ sudo vim /etc/grub.d/40_custom
# 3. Regenerate:
$ sudo update-grub # Debian/Ubuntu
$ sudo grub2-mkconfig -o /boot/grub2/grub.cfg # RHEL
# 4. Verify the generated config looks right:
$ grep menuentry /boot/grub/grub.cfg
initramfs Missing Critical Drivers¶
The footgun: Updating the kernel or storage configuration without regenerating initramfs, or regenerating it without the necessary drivers.
# Scenario 1: You move root to a new storage controller (e.g., from SATA to NVMe)
# but forget to rebuild initramfs with the NVMe driver.
# Kernel boots, can't find root filesystem -> panic
# Scenario 2: You update dracut config to exclude "unnecessary" modules
# /etc/dracut.conf.d/slim.conf
omit_drivers+=" megaraid_sas mpt3sas "
# Rebuild initramfs, reboot... kernel can't see your RAID array
# Scenario 3: You switch root filesystem from ext4 to XFS
# but initramfs still only has ext4 module
Prevention:
# After any storage change, rebuild initramfs:
$ sudo update-initramfs -u # Debian
$ sudo dracut -f # RHEL
# Verify the initramfs has the drivers you need:
$ lsinitramfs /boot/initrd.img-$(uname -r) | grep -i "nvme\|megaraid\|xfs"
# Before rebooting into a new kernel, verify initramfs exists and is non-empty:
$ ls -la /boot/initrd.img-5.15.0-93-generic
-rw-r--r-- 1 root root 67108864 Mar 19 10:00 /boot/initrd.img-5.15.0-93-generic
# If the file is tiny (< 1 MB) or missing, DO NOT REBOOT
# Dracut: check what modules are included:
$ lsinitrd /boot/initramfs-$(uname -r).img | head -30
Changing the Default systemd Target Without Understanding Dependencies¶
The footgun: Setting the default target to something that doesn't include networking or SSH.
# "I don't need a GUI on this server"
$ sudo systemctl set-default multi-user.target
# This is usually fine — multi-user.target includes networking and SSH
# But what if someone does:
$ sudo systemctl set-default rescue.target
# Rescue mode requires physical console access
# Remote servers are now unreachable after reboot
# Or:
$ sudo systemctl isolate emergency.target
# Emergency mode: minimal services, no networking, no SSH
# On a remote server, this is instant loss of access
Prevention:
# For servers, the correct target is almost always multi-user.target
$ sudo systemctl set-default multi-user.target
# Before isolating to a different target, understand what it includes:
$ systemctl list-dependencies rescue.target
# Note: no network.target, no sshd.service
# If you need to change targets for maintenance on a remote server,
# use a timed revert:
$ sudo systemctl set-default rescue.target
$ echo "systemctl set-default multi-user.target" | at now + 10 minutes
# If something goes wrong, the system reverts in 10 minutes
$ sudo reboot
Disabling Services You Don't Understand¶
The footgun: Disabling systemd services to "speed up boot" or "harden the system" without understanding their dependencies.
# "I don't need this, it slows down boot"
$ sudo systemctl disable systemd-udevd.service
# Hardware detection won't work. Enjoy no storage/network.
$ sudo systemctl disable systemd-journald.service
# All logging stops. You'll have zero diagnostics when things break.
$ sudo systemctl disable dbus.service
# D-Bus is the system message bus. NetworkManager, systemd, polkit — all broken.
Safe boot optimization:
# First, identify what's actually slow:
$ systemd-analyze blame | head -15
# Only disable services you understand and know aren't needed:
$ sudo systemctl disable snapd.service # If you don't use snaps
$ sudo systemctl disable ModemManager.service # Servers don't have modems
$ sudo systemctl disable bluetooth.service # Servers don't need bluetooth
$ sudo systemctl disable cups.service # Print service on a server? No.
# NEVER disable without checking dependencies:
$ systemctl list-dependencies --reverse systemd-udevd.service
# Shows what depends on this service
UEFI Boot Entry Corruption¶
The footgun: Manually editing UEFI NVRAM variables or the ESP without understanding the boot entry structure.
# Removing all boot entries:
$ sudo efibootmgr -b 0001 -B
$ sudo efibootmgr -b 0002 -B
# Now there are no boot entries. Firmware has nothing to boot.
# Or deleting files from ESP:
$ sudo rm -rf /boot/efi/EFI/ubuntu/
# UEFI firmware can't find the bootloader.
Recovery:
# Boot from live USB, then:
$ sudo mount /dev/sda1 /mnt/boot/efi
$ sudo mount /dev/sda2 /mnt
$ for fs in dev proc sys run; do sudo mount --bind /$fs /mnt/$fs; done
$ sudo chroot /mnt
# Reinstall GRUB and recreate boot entry
$ grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=ubuntu
$ update-grub
# Or just recreate the boot entry:
$ efibootmgr --create --disk /dev/sda --part 1 --loader /EFI/ubuntu/shimx64.efi --label "Ubuntu"
Kernel Command Line Parameter Mistakes¶
The footgun: Adding kernel parameters to /etc/default/grub with typos or wrong values, then rebooting.
# Typo in root device:
GRUB_CMDLINE_LINUX="root=/dev/sda3" # Should be sda2
# System can't find root -> kernel panic
# Setting console to a non-existent serial port:
GRUB_CMDLINE_LINUX="console=ttyS1,115200"
# All boot output goes to a serial port that doesn't exist
# Console is blank, but system might actually be running
# Accidentally adding "quiet" when debugging boot issues:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
# You're trying to see boot messages but they're hidden
Prevention:
# Test parameter changes by editing GRUB at boot time first:
# Press 'e' at GRUB menu, modify the linux line, Ctrl+X to boot
# If it works, THEN make the change permanent in /etc/default/grub
# Always regenerate GRUB config after changes:
$ sudo update-grub
# Verify the generated config has your changes:
$ grep CMDLINE /boot/grub/grub.cfg | head -5
Reboot Loops from systemd Unit Failures¶
The footgun: A systemd unit configured with OnFailure=reboot.target or FailureAction=reboot that keeps failing.
# A "critical" service configured to reboot on failure:
[Unit]
Description=Critical Service
OnFailure=reboot.target
[Service]
ExecStart=/opt/broken-app/start.sh
# The app crashes immediately every time
# The system reboots, app crashes, system reboots, app crashes...
# Infinite reboot loop
Prevention:
# Use restart limits instead of reboot triggers:
[Service]
Restart=on-failure
RestartSec=5s
StartLimitIntervalSec=300
StartLimitBurst=5
# After 5 failures in 300 seconds, give up (don't reboot)
# If you must have reboot-on-failure, add a delay and limit:
[Unit]
OnFailure=reboot.target
StartLimitAction=none # Don't take action on start limit, just stop trying
Recovery from reboot loop:
1. At GRUB menu, press e and add systemd.unit=rescue.target to kernel line
2. Boot to rescue mode
3. Disable the problematic service: systemctl disable broken-app.service
4. Reboot normally
5. Fix the service, then re-enable