Homelab Footguns¶

Mistakes that brick your lab, lose your data, or turn a learning exercise into a weekend recovery project.

1. Running Proxmox on a single disk with no backups¶

You install Proxmox on one SSD. You create 15 VMs over three months. The SSD dies. Everything is gone — VMs, configs, your entire lab setup. "I'll set up backups later" cost you three months of work.

Fix: Use ZFS mirrors (minimum two disks) from day one. Back up VM configs to a USB drive or remote NFS weekly. Export your cloud-init templates. Store your Ansible playbooks in Git. If you can't rebuild from scratch in an hour, your backup strategy is broken.

2. Giving your homelab the same subnet as your home network¶

You set up Proxmox VMs on 192.168.1.0/24 — the same subnet as your home router. IP conflicts everywhere. Your spouse's laptop gets the same IP as your k3s node. ARP storms make everything intermittent.

Fix: Use a separate subnet for lab traffic. 10.10.20.0/24 is a safe choice. Use VLANs if your switch supports it. At minimum, use a different /24 that your home router won't hand out via DHCP.

3. Port-forwarding SSH directly to the internet¶

You want to access your lab remotely so you forward port 22 on your home router to your Proxmox host. Within hours, brute-force bots are hammering your SSH. Your auth.log fills up. One weak password and you're owned.

Fix: Use WireGuard VPN. Forward only UDP 51820 to your VPN endpoint. Access everything else through the tunnel. Never expose SSH, Proxmox UI, or any management interface directly to the internet.

Gotcha: WireGuard uses UDP, which some corporate/hotel networks block. Have a Cloudflare Tunnel or Tailscale (which uses DERP relays over HTTPS) as a fallback for restrictive networks. A VPN you can't connect to when traveling is not useful.

4. Upgrading Proxmox without reading the release notes¶

Major Proxmox version upgrade (7 to 8). You run apt dist-upgrade without checking compatibility. Your Ceph cluster breaks. Your network config format changed. VMs won't start because the storage backend changed.

Fix: Read the official upgrade guide for every major version. Run pve7to8 --full (or equivalent checker) before upgrading. Snapshot your Proxmox host or back up /etc/pve/ before any major upgrade. Test on one node first in a multi-node cluster.

Remember: /etc/pve/ is a cluster-aware filesystem (pmxcfs) that stores VM configs, user permissions, and cluster state. Backing up this directory before upgrades captures everything needed to recreate your cluster configuration. tar czf /root/pve-backup-$(date +%F).tar.gz /etc/pve/ takes seconds and saves hours.

5. Using k3s `--cluster-reset` when you just needed a restart¶

Your k3s cluster is acting weird. You Google the error, find a forum post saying "just run --cluster-reset." That command wipes the etcd datastore. Your deployments, services, ingress rules, secrets — all gone. The cluster is "fixed" because it's now empty.

Fix: --cluster-reset is a nuclear option. Try systemctl restart k3s first. Check journalctl -u k3s for the actual error. If etcd is corrupted, --cluster-reset is appropriate, but understand that it wipes all cluster state. Back up your manifests in Git so you can reapply.

Debug clue: k3s stores its embedded etcd data at /var/lib/rancher/k3s/server/db/etcd/. If the cluster won't start, check du -sh on this directory — if it's abnormally large (>1GB for a small lab), etcd may need compaction, not a reset. Try k3s etcd-snapshot save before resorting to --cluster-reset.

6. ZFS snapshots piling up silently¶

You enabled automatic ZFS snapshots (or Proxmox backup jobs that create snapshots). You never prune them. Six months later, your pool is 95% full even though your VMs only use 30% of the raw space. ZFS performance degrades badly above 80% capacity.

Fix: Set up snapshot pruning from day one. Use zfs-auto-snapshot with sane retention (e.g., keep 4 hourly, 7 daily, 4 weekly). Monitor pool usage with alerts at 70% and 80%.

Under the hood: ZFS snapshots are free when created but grow as the original data changes (copy-on-write). Each snapshot holds references to old blocks that can't be freed. zfs list -t snapshot -o name,used,refer shows how much space each snapshot is preventing from being reclaimed.

7. Running PiHole as your only DNS with no fallback¶

PiHole VM is down for maintenance. Every device in your house loses DNS. Your family can't load any websites. You can't even Google the fix because DNS is down.

Fix: Run two PiHole instances (primary and secondary). Configure your DHCP server to hand out both IPs. Alternatively, set your router as secondary DNS with upstream forwarding to 1.1.1.1 as a fallback.

8. Hardcoding IPs everywhere instead of using DNS¶

You access Grafana at http://10.10.20.15:3000, Gitea at http://10.10.20.16:3000, ArgoCD at https://10.10.20.17:8080. You change a VM's IP and now half your bookmarks and configs are broken. You can never remember which IP is which.

Fix: Set up local DNS on day one (PiHole custom DNS or a dedicated CoreDNS). Use names: grafana.lab.home, gitea.lab.home. Update DNS when IPs change. Your muscle memory and your configs will thank you.

War story: A common homelab failure: you configure 20 services pointing at 10.10.20.15 for your database. You rebuild the database VM and it gets a new IP. Now you're grepping through every config file, docker-compose, and bookmark to find all references to the old IP. DNS + DHCP reservations make IP changes a one-line fix instead of a 20-service outage.

9. Not labeling cables and ports in your physical setup¶

You have 8 Ethernet cables running into a switch. Three months later, you need to unplug one. You unplug the wrong one — that was the Proxmox management interface for the node hosting your k3s control plane. Cascade failure.

Fix: Label every cable at both ends. Document which switch port goes to which device in your Git-tracked lab config. Take a photo of your setup after cabling changes. A $10 label maker prevents hours of troubleshooting.

10. Treating your homelab like production (or not enough like production)¶

Two failure modes. Over-engineering: you spend months perfecting a HA Ceph cluster and never actually deploy an application. Under-engineering: you SSH into every node and make changes by hand, never learning the automation tools the lab was supposed to teach you.

Fix: Follow the 80/20 rule. Automate the things that teach you production skills (Ansible for config, Helm for deployments, Git for everything). Skip the things that only add complexity without learning value (triple-redundant storage for a lab you can rebuild in a day). The goal is learning, not uptime.