Homelab & Learning Infrastructure - Street-Level Ops¶

What experienced homelab operators know that the subreddit posts don't tell you.

Quick Diagnosis Commands¶

# Proxmox cluster health
pvecm status                          # Cluster membership and quorum
pvecm expected 1                      # Force quorum when nodes are down (DANGEROUS)
pvesr list                            # Replication status
zpool status                          # ZFS pool health
zpool list                            # ZFS space usage

# k3s health
systemctl status k3s                  # Server node service
systemctl status k3s-agent            # Worker node service
kubectl get nodes -o wide             # Node status with IPs
kubectl top nodes                     # Resource usage
crictl ps                             # Container runtime status

# Network diagnostics
ip -br addr                           # All interfaces, brief format
bridge vlan show                      # VLAN assignments on bridge ports
cat /etc/network/interfaces           # Proxmox network config
wg show                               # WireGuard tunnel status

# DNS checks
dig @<pihole-ip> gitea.lab.home       # Test local DNS resolution
pihole -c -e                          # PiHole stats (console)
pihole -t                             # Tail PiHole query log in real time

Gotcha: Proxmox Cluster Quorum Loss¶

You have a 3-node Proxmox cluster. One node goes down for maintenance. Then a second node loses power. The surviving node can't get quorum and refuses to start VMs.

Fix: On the surviving node, force expected votes:

pvecm expected 1

This is a temporary override. It lets the surviving node operate standalone. When the other nodes come back, quorum restores automatically. Never leave this set permanently — it defeats the purpose of quorum protection.

Remember: Quorum requires >50% of nodes. For a 3-node cluster, that means 2 nodes must be up. For a 2-node cluster, quorum is impossible with one failure — which is why 2-node clusters are worse than 1-node for availability. Odd numbers only: 1, 3, 5.

Gotcha: k3s Embedded etcd Quorum¶

Same quorum problem, Kubernetes edition. You ran k3s with embedded etcd on 3 server nodes. Two go down. The remaining server is read-only — it can't schedule pods or update state.

Fix: If nodes are coming back soon, wait. If they're gone permanently:

# On the surviving server node, reset to single-node etcd
k3s server --cluster-reset
# Then rejoin new nodes fresh

For homelabs, consider running a single server node with --cluster-init disabled to avoid etcd quorum issues entirely. Add worker-only agents for capacity.

Gotcha: ZFS ARC Eating All Your RAM¶

You installed Proxmox on ZFS. You notice your VMs are slow and RAM looks fully consumed. ZFS ARC (Adaptive Replacement Cache) uses all available RAM by default. It will release memory under pressure, but the OOM killer might get there first.

Fix: Limit ARC size in /etc/modprobe.d/zfs.conf:

# Set max ARC to 4GB (adjust for your total RAM)
echo "options zfs zfs_arc_max=4294967296" > /etc/modprobe.d/zfs.conf
update-initramfs -u
reboot

# Verify
cat /proc/spl/kstat/zfs/arcstats | grep c_max

Rule of thumb: leave at least 1GB per running VM/container for the host, plus 2GB for the OS.

Default trap: ZFS ARC has no default maximum — it will consume all available RAM. The OOM killer may terminate your VMs before ARC releases memory because the kernel sees ARC as "used" not "cache." Always set zfs_arc_max explicitly on any Proxmox host running VMs.

Gotcha: Cloud-Init Template Not Updating¶

You cloned a cloud-init template, changed the IP configuration, but the VM still boots with the old IP. Cloud-init only runs on first boot by default.

Fix: Either regenerate the cloud-init drive:

qm set <vmid> --ipconfig0 ip=10.10.20.50/24,gw=10.10.20.1
qm cloudinit dump <vmid> user    # Verify the config

Or force cloud-init to re-run inside the VM:

cloud-init clean --logs
reboot

Gotcha: k3s Traefik Conflicts with Your Own Ingress¶

k3s ships with Traefik as the default ingress controller. You installed nginx-ingress on top. Now you have two ingress controllers fighting over port 80/443.

Fix: Disable Traefik at k3s install time:

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--disable=traefik" sh -

If k3s is already running:

# Add to /etc/rancher/k3s/config.yaml
# disable:
#   - traefik
systemctl restart k3s
kubectl delete helmchart traefik -n kube-system

Pattern: Git-Driven Homelab Configuration¶

Store everything in a Git repo. This is your first GitOps practice:

homelab-config/
├── ansible/
│   ├── inventory/homelab.yml
│   ├── playbooks/
│   │   ├── base-config.yml
│   │   ├── proxmox-setup.yml
│   │   ├── k3s-deploy.yml
│   │   └── services.yml
│   └── roles/
├── helm-values/
│   ├── monitoring.yaml
│   ├── gitea.yaml
│   ├── argocd.yaml
│   └── pihole.yaml
├── k8s-manifests/
│   ├── namespaces.yaml
│   ├── ingress/
│   └── storage/
├── docs/
│   ├── network-diagram.md
│   ├── ip-assignments.md
│   └── hardware-inventory.md
└── scripts/
    ├── backup-proxmox.sh
    ├── rebuild-k3s.sh
    └── restore-from-scratch.sh

The restore-from-scratch.sh script is the most important file. It should take a fresh Proxmox install to a fully working lab. If it doesn't, your documentation has gaps.

War story: An engineer's homelab NVMe died. Rebuild took 3 weekends because "I'll document it later" never happened. The second time, they scripted the entire rebuild. Third failure: 2 hours from bare metal to running services. The script is the documentation.

Pattern: Cheap HA for Critical Lab Services¶

PiHole goes down and your entire house loses DNS (including your spouse's Netflix). This is your first lesson in HA.

Primary PiHole (10.10.20.2)  ←→  Secondary PiHole (10.10.20.3)
         │                                    │
         └── DHCP pushes both DNS ────────────┘
             servers to all clients

# Sync configs with gravity-sync
# https://github.com/vmstan/gravity-sync
gravity-sync push    # Primary → Secondary
gravity-sync pull    # Secondary ← Primary
gravity-sync auto    # Cron-based auto-sync

Set your router's DHCP to hand out both PiHole IPs as DNS servers. When one goes down, clients fail over to the other.

Gotcha: Most clients try the second DNS server only after a timeout (typically 5 seconds), not instantly. Expect a brief stall when the primary PiHole is down, not a seamless failover. Some devices (smart TVs, IoT) only use the first DNS server and never try the second.

Pattern: Lab Network IP Addressing Plan¶

Plan your IPs before you start. Changing them later is painful:

10.10.10.0/24  — Management VLAN 10
  .1           — Gateway (router)
  .2-.10       — Proxmox hosts
  .11-.20      — Switches, APs, OOB interfaces

10.10.20.0/24  — Servers VLAN 20
  .1           — Gateway
  .2-.3        — DNS (PiHole primary/secondary)
  .10-.50      — Static VMs (Gitea, Grafana, etc.)
  .100-.200    — DHCP range for new VMs

10.10.30.0/24  — IoT VLAN 30 (untrusted)
  .1           — Gateway
  .100-.254    — DHCP only

10.10.40.0/24  — Kubernetes VLAN 40
  .1           — Gateway
  .10-.12      — k3s server nodes
  .20-.50      — k3s agent nodes
  .200-.240    — MetalLB / ServiceLB pool

10.200.0.0/24  — WireGuard VPN
  .1           — VPN server
  .2-.10       — Client devices

Emergency: Lab Won't Boot After Power Outage¶

Power went out (or your UPS died). Everything is offline.

1. Check physical connections first — cables come loose
2. Boot order matters:
   a. Network switch (wait for it to initialize VLANs)
   b. NAS/storage (if VMs store data on NFS)
   c. Proxmox hosts (one at a time, server node first)
   d. Verify cluster quorum: pvecm status
   e. Start critical VMs: PiHole, WireGuard
   f. Start k3s server node, then agents
   g. Verify: kubectl get nodes && kubectl get pods -A

3. If ZFS pool won't import:
   zpool import -f <pool-name>    # Force import after unclean shutdown

4. If k3s won't start:
   journalctl -u k3s --since "5 minutes ago"
   # Common: etcd took too long, increase timeout
   # Or: --cluster-reset if etcd is corrupted

Emergency: Ran Out of Disk Space on Proxmox¶

VMs won't start. LXC containers are read-only. ZFS is full.

1. Check what's eating space:
   zpool list
   zfs list -o name,used,avail,refer -s used

2. Delete old VM snapshots (biggest space saver):
   qm listsnapshot <vmid>
   qm delsnapshot <vmid> <snapshot-name>

3. Delete old ISO images:
   ls -lh /var/lib/vz/template/iso/
   rm /var/lib/vz/template/iso/old-distro.iso

4. Prune container images (if running Docker inside VMs):
   docker system prune -a --volumes

5. Expand the ZFS pool (if you have empty drive bays):
   # Add a new vdev (NOT recommended to add single drives to mirrors)
   zpool add <pool> <device>

Emergency: Locked Out of Proxmox Web UI¶

You changed the management IP, broke the network config, or the certificate expired.

1. Console access (keyboard + monitor, or IPMI/iDRAC):
   # Fix network config
   nano /etc/network/interfaces
   systemctl restart networking

2. Reset web UI password:
   pveum passwd root@pam

3. If SSL cert is broken:
   pvecm updatecerts --force
   systemctl restart pveproxy

4. If all else fails — Proxmox stores VM configs in:
   /etc/pve/qemu-server/<vmid>.conf     # VM configs
   /etc/pve/lxc/<ctid>.conf              # Container configs
   # These are on a cluster filesystem (pmxcfs)
   # Your VMs and data are safe even if the UI is broken