Edge & IoT Infrastructure Footguns¶

Mistakes that brick remote devices, blow your cellular data budget, or leave field hardware permanently unreachable.

1. Pushing an update without testing the rollback path¶

You tested the update on a bench device. It worked. You pushed it to 200 field devices. The update has a subtle bug that only manifests after 6 hours. The automatic rollback timer was set to 10 minutes. The devices "passed" the rollback window, committed the bad update, and are now stuck with broken firmware and no way back.

Fix: Test the rollback independently of the update. Intentionally brick a bench device and verify it rolls back. Set your rollback health check to validate actual application behavior, not just "did it boot." Extend the soak period to at least 24 hours before marking an update as permanent.

Gotcha: A/B partition schemes (used by SWUpdate, RAUC, Mender) only protect the OS image. If the update also migrates application data or config files, a rollback restores the old OS but leaves the new (possibly incompatible) data format. Always make data migrations backward-compatible.

2. Using SD cards for production edge devices¶

Raspberry Pi with an SD card running your production edge workload. After 12-18 months, the SD card starts throwing read-only remounts due to flash wear. The device becomes unresponsive. You lose data that wasn't synced to the control plane yet.

Fix: Boot from USB SSD on Pi 4+ (faster, more durable). For industrial edge, use eMMC or industrial-grade SD cards rated for high write endurance. Mount volatile paths (logs, tmp) as tmpfs to reduce flash writes. Monitor write cycles and replace proactively.

3. Hardcoding the control plane IP instead of DNS¶

Your 500 edge devices connect to the control plane at 203.0.113.50. You need to migrate the control plane to a new server. You can't update the IP on devices that are only reachable through the control plane — the one you're trying to change the IP of.

Fix: Always use DNS names for the control plane. If you must change the server, update the DNS record. Devices will resolve the new IP. If devices use a local DNS cache, set a low TTL on the control plane record.

4. No local data buffer on the device¶

Your edge device streams sensor data directly to the cloud over cellular. The network drops for 2 hours. Two hours of data is lost forever. There was no local buffer.

Fix: Always buffer locally first. Write to a local SQLite database, a file-based queue, or systemd journal. Upload when connected. Mark as uploaded only after the server acknowledges receipt. Size the buffer for your worst-case connectivity gap (72 hours is a reasonable target).

5. Cellular SIM without a data cap or monitoring¶

Your fleet of 100 devices has unlimited data SIMs. A bug in your logging causes one device to upload 50GB in a month instead of 500MB. You don't notice until the carrier throttles the entire account or sends a $10,000 bill.

Fix: Set up per-SIM data caps with your carrier. Monitor data usage per device. Alert when a device exceeds 2x its expected monthly usage. Rate-limit the cellular interface with tc or iptables. Use vnstat to track per-interface data consumption on each device.

6. Running full Kubernetes on a 1GB RAM device¶

You love Kubernetes so you install k3s on a device with 1GB of RAM. k3s takes 512MB. Your application needs 300MB. CoreDNS needs 70MB. The system needs 200MB. You're over budget. The OOM killer starts picking off pods at random. The device becomes unreliable.

Fix: k3s minimum is 2GB RAM for a comfortable single-node setup. For devices under 2GB, use plain Docker with docker-compose, or just systemd services. Not every edge device needs an orchestrator. A well-written systemd unit file with restart policies is perfectly valid for single-device deployments.

7. No watchdog timer configured¶

Your edge device hangs due to a kernel panic, a deadlocked application, or a hardware glitch. It stays hung forever. Nobody knows because the device isn't reporting metrics (because it's hung). The failure is only discovered when someone physically visits the site weeks later.

Fix: Enable the hardware watchdog timer. Configure systemd's RuntimeWatchdogSec. If the system stops responding, the hardware watchdog reboots it within 30-60 seconds. For application-level hangs, use WatchdogSec in your systemd service file so systemd restarts the process.

Under the hood: The hardware watchdog (/dev/watchdog) must be "petted" (written to) at regular intervals. If the kernel hangs and can't pet the watchdog, the hardware triggers a hard reset. Set RuntimeWatchdogSec=20 in /etc/systemd/system.conf and systemd will pet the hardware watchdog every 10 seconds (half the timeout).

8. Shared SSH keys across the entire fleet¶

All 500 edge devices use the same SSH key pair for reverse tunnel access. One device is physically compromised and the key is extracted. The attacker now has SSH access to every device in your fleet via the shared bastion.

Fix: Use per-device unique credentials. Generate a unique SSH key pair for each device during provisioning. Even better: use SSH certificates with short expiry issued by your CA. Rotate keys regularly. If one device is compromised, only one key needs to be revoked.

9. Updating all devices simultaneously¶

You pushed an OTA update to all 500 devices at once. The update server buckles under 500 simultaneous 200MB downloads. Half the downloads fail mid-transfer. The devices that got partial downloads are now in an inconsistent state. Some rolled back, some are stuck.

Fix: Stagger updates across the fleet. Update in waves: 1% first, wait 24 hours, then 5%, then 25%, then the rest. Use a CDN or edge cache for update distribution. Implement resume-capable downloads with checksum verification so a partial download can be completed later.

10. Not encrypting data at rest on physically accessible devices¶

Your edge device is in a retail store, a vehicle, or a field installation. Someone pulls the disk/SD card. All your application data, API keys, certificates, and customer data are readable in plaintext. Congratulations, that's a data breach.

Fix: Encrypt the data partition with LUKS. Store the encryption key in a TPM if available, or derive it from a unique device secret provisioned during manufacturing. The root filesystem can be read-only and unencrypted (it's just the OS), but all application data and credentials must be encrypted at rest.

War story: In 2019, a fleet of unencrypted IoT gateways deployed in retail stores was physically stolen during a break-in. The SD cards contained WiFi credentials, VPN certificates, and API tokens. The attacker used the VPN certificates to access the company's internal network. LUKS encryption would have made the stolen data useless.