Portal | Level: L2: Operations | Topics: Edge & IoT Infrastructure, Linux Fundamentals | Domain: DevOps & Tooling
Edge & IoT Infrastructure - Primer¶
Why This Matters¶
Not all infrastructure lives in a data center or cloud region with reliable power, fast networks, and physical access. Edge computing pushes workloads to locations where the network is slow, the hardware is constrained, and you can't SSH in when things break. If you've managed fleet-scale Linux infrastructure, edge extends those skills into hostile environments: cell towers, retail stores, factory floors, oil rigs, and moving vehicles. The patterns are the same — configuration management, monitoring, updates, security — but the constraints force different solutions.
The edge is growing fast. Every industry that deploys hardware in the field needs ops people who understand how to manage Linux devices they can't physically touch, over networks that drop packets like a sieve.
Name origin: "Edge computing" gets its name from network topology diagrams, where end-user devices and remote locations sit at the "edge" of the network graph, far from the centralized core or cloud.
Core Concepts¶
1. The Edge Spectrum¶
Edge isn't a single thing — it's a spectrum of distance from your control plane:
Network Quality
◄────────────────►
Data Center Near Edge Far Edge Deep Edge
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Full K8s │ │ k3s │ │ Minimal │ │ Bare │
│ clusters │ │ MicroK8s │ │ Linux │ │ RTOS │
│ 100 Gbps │ │ 1 Gbps │ │ LTE/4G │ │ LoRa │
│ Always on│ │ Usually │ │ Intermit.│ │ Sporadic │
│ Physical │ │ Remote │ │ No local │ │ Can't │
│ access │ │ hands │ │ hands │ │ reach │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
AWS region Retail store Cell tower Agricultural
Colo/DC Branch office Vehicle sensor
Factory floor Pipeline Subsea cable
2. Lightweight Linux for Constrained Devices¶
Standard Ubuntu Server is 2+ GB. On an edge device with 4GB of storage and 512MB RAM, you need a smaller OS:
Alpine Linux:
# ~5MB base image, musl libc, busybox userland
# Great for containers and lightweight VMs
# Package manager: apk
apk update
apk add docker openssh wireguard-tools
# WARNING: musl libc != glibc
# Some binaries compiled for glibc won't work on Alpine
# Python packages with C extensions may need --no-binary
Yocto Project (custom build system):
# Build a custom Linux image from scratch
# You choose: kernel config, packages, init system, everything
# Result: an OS image tailored to your exact hardware + use case
# Basic workflow:
git clone git://git.yoctoproject.org/poky
source oe-init-build-env
bitbake core-image-minimal # ~8MB image
# Add layers for your hardware and software:
bitbake-layers add-layer meta-raspberrypi
bitbake-layers add-layer meta-openembedded
Buildroot (simpler alternative to Yocto):
# Simpler than Yocto, generates complete embedded Linux images
make menuconfig # Select packages, kernel config, bootloader
make # Build the entire image
# Output: rootfs image, kernel, bootloader — ready to flash
| Distro | Image Size | Best For | Learning Curve |
|---|---|---|---|
| Alpine | ~5 MB | Containers, lightweight servers | Low |
| Yocto | Custom (8MB+) | Production embedded devices | High |
| Buildroot | Custom (5MB+) | Simpler embedded builds | Medium |
| Ubuntu Core | ~200 MB | Snap-based IoT devices | Low |
| Raspberry Pi OS Lite | ~400 MB | Pi-based projects | Low |
3. Remote Fleet Management¶
Managing devices you can't physically access requires different tools:
┌──────────────────────────────────────────────────┐
│ Control Plane (your data center / cloud) │
│ ├── Fleet management server │
│ ├── OTA update server │
│ ├── Monitoring / telemetry ingestion │
│ └── Configuration management (Ansible/Salt) │
└────────────────────┬─────────────────────────────┘
│ (internet / cellular / satellite)
│ unreliable, high-latency, metered
│
┌────────────────┼────────────────────┐
│ │ │
┌───▼────┐ ┌────▼───┐ ┌──────▼───┐
│ Edge │ │ Edge │ │ Edge │
│ Site A │ │ Site B │ │ Site C │
│ (k3s) │ │ (bare) │ │ (offline │
│ │ │ │ │ 48hrs) │
└────────┘ └────────┘ └──────────┘
SSH over unreliable connections:
# autossh: maintains persistent SSH tunnels with auto-reconnect
autossh -M 0 -f -N -R 2222:localhost:22 user@bastion \
-o "ServerAliveInterval 30" \
-o "ServerAliveCountMax 3" \
-o "ExitOnForwardFailure yes"
# Now SSH to the edge device via the bastion:
ssh -p 2222 user@bastion
# Mosh: mobile shell — survives network changes, roaming, sleep
mosh user@edge-device # Uses UDP, handles high latency/packet loss
Name origin: MQTT originally stood for "MQ Telemetry Transport" (the "MQ" from IBM's MQSeries message queue product). Andy Stanford-Clark (IBM) and Arlen Nipper invented it in 1999 to monitor oil pipelines over expensive satellite links. Since 2013, MQTT is officially just a name, not an acronym.
MQTT for device telemetry:
# MQTT: lightweight pub/sub protocol designed for constrained networks
# Broker: mosquitto (runs on your control plane)
apt install mosquitto mosquitto-clients
# Edge device publishes metrics:
mosquitto_pub -h broker.example.com -t "edge/site-a/cpu" -m "45.2" -q 1
# Control plane subscribes:
mosquitto_sub -h broker.example.com -t "edge/+/cpu"
# QoS levels:
# 0 = fire and forget (may lose messages)
# 1 = at least once (may duplicate)
# 2 = exactly once (highest overhead)
4. OTA (Over-The-Air) Update Strategies¶
Updating remote devices safely is one of the hardest problems in edge computing:
A/B Partition Scheme:
┌────────────────────────────────────────┐
│ Device Storage Layout │
│ │
│ ┌─────────┐ ┌─────────┐ ┌────────┐ │
│ │ Part A │ │ Part B │ │ Data │ │
│ │ (active)│ │ (update)│ │ (perst)│ │
│ │ v1.2.0 │ │ v1.3.0 │ │ │ │
│ └─────────┘ └─────────┘ └────────┘ │
│ │
│ Boot: A (current) → if fails → B │
│ Update: write to B → reboot → try B │
│ Rollback: if B fails 3x → back to A │
└────────────────────────────────────────┘
# SWUpdate: open-source OTA update framework for embedded Linux
# Edge device checks for updates and applies A/B swap:
swupdate -v -i update.swu -e "stable,upgrade_partition_b"
# RAUC (another A/B update framework):
rauc install update.raucb
rauc status # Show which partition is active
Update safety rules:
1. NEVER update all devices at once
- Canary: update 1% of fleet, wait, observe
- Phased: 1% → 5% → 25% → 100%
- Each phase: minimum 24-hour soak period
2. ALWAYS have automatic rollback
- Boot counter: if new version fails to boot 3 times, revert
- Health check: if new version fails health check, revert
- Watchdog timer: if system doesn't ping home within N minutes, revert
3. NEVER update the bootloader unless absolutely necessary
- A bad bootloader update = bricked device
- A bad OS update with A/B = recoverable
Remember: OTA rollback rule of thumb: "A-B-C -- Always Be Canary-ing." Never push firmware to 100% of your fleet at once. The A/B partition scheme gives you a safety net, but canary rollout percentages keep the blast radius small.
5. Edge Kubernetes (k3s, MicroK8s, KubeEdge)¶
┌──────────────────────────────────────────────────────────┐
│ k3s (Rancher) │
│ - Single binary, ~70MB │
│ - Full Kubernetes API compatibility │
│ - SQLite or etcd backend │
│ - ARM64 + AMD64 │
│ - Best for: edge sites with 1-10 nodes │
├──────────────────────────────────────────────────────────┤
│ MicroK8s (Canonical) │
│ - Snap-based, single node or clustered │
│ - Add-ons: istio, gpu, registry, dns │
│ - Best for: Ubuntu-based edge, developer workstations │
├──────────────────────────────────────────────────────────┤
│ KubeEdge (CNCF) │
│ - Cloud part (CloudCore) + Edge part (EdgeCore) │
│ - Works with intermittent connectivity │
│ - Edge nodes can operate independently when disconnected│
│ - Best for: thousands of edge devices managed centrally │
└──────────────────────────────────────────────────────────┘
# k3s on a Raspberry Pi:
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--disable=traefik \
--write-kubeconfig-mode 644 \
--node-name edge-site-001" sh -
# KubeEdge: install EdgeCore on edge device
keadm join --cloudcore-ipport=control-plane:10000 \
--edgenode-name=edge-site-001 \
--kubeedge-version=1.15.0 \
--token=<join-token>
6. Monitoring with Limited Bandwidth¶
Standard Prometheus scraping doesn't work when your edge device is on a 1 Mbps cellular connection with metered data:
Strategy 1: Edge-local Prometheus + remote write (aggregated)
┌────────────────┐ ┌──────────────────┐
│ Edge Device │ push │ Central Prom │
│ ┌──────────┐ │─────────────→│ (receives │
│ │ Prom │ │ (batched, │ aggregated │
│ │ (local) │ │ compressed)│ metrics) │
│ └──────────┘ │ └──────────────────┘
└────────────────┘
Strategy 2: Telegraf/Vector with batching
┌────────────────┐ ┌──────────────────┐
│ Edge Device │ batch │ Central TSDB │
│ ┌──────────┐ │─────────────→│ (InfluxDB / │
│ │ Telegraf │ │ every 5min │ Prometheus) │
│ │ + buffer │ │ or when │ │
│ └──────────┘ │ connected └──────────────────┘
└────────────────┘
Strategy 3: MQTT-based metrics
┌────────────────┐ ┌──────────────────┐
│ Edge Device │ MQTT │ Broker → │
│ ┌──────────┐ │─────────────→│ Telegraf → │
│ │ Custom │ │ QoS 1 │ Prometheus │
│ │ exporter │ │ (tiny │ │
│ └──────────┘ │ payload) └──────────────────┘
└────────────────┘
7. Security for Remote Devices¶
Devices you can't physically access are the hardest to secure:
# Mandatory security measures for edge devices:
# 1. Encrypted storage (LUKS)
cryptsetup luksFormat /dev/sda2
cryptsetup open /dev/sda2 data_crypt
# 2. Signed OTA updates (don't accept unsigned firmware)
# In your update server:
openssl dgst -sha256 -sign private.key -out update.sig update.swu
# On device: verify before applying
openssl dgst -sha256 -verify public.key -signature update.sig update.swu
# 3. Automatic certificate rotation
# Use short-lived certs (hours, not years)
# Device requests new cert via mTLS to CA
# 4. Firewall: deny all, allow specific outbound
iptables -P OUTPUT DROP
iptables -A OUTPUT -d <control-plane-ip> -p tcp --dport 443 -j ACCEPT
iptables -A OUTPUT -d <mqtt-broker-ip> -p tcp --dport 8883 -j ACCEPT
iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
# 5. Read-only root filesystem
# Mount root as ro, use overlayfs for writable layers
# This prevents persistent malware
8. Cellular Failover¶
When wired internet is unreliable, cellular provides a backup path:
# Using ModemManager + NetworkManager for cellular failover
nmcli connection add type gsm \
con-name "cellular-backup" \
ifname cdc-wdm0 \
apn "internet" \
connection.autoconnect-priority -10 # Lower than wired
# Automatic failover: NetworkManager handles this
# When wired goes down, cellular activates
# When wired comes back, traffic switches back
# Monitor connectivity:
mmcli -m 0 # Modem status
mmcli -m 0 --signal-get # Signal strength
nmcli device status # Active connections
Common Pitfalls¶
- Assuming the network is reliable. It isn't. Design every protocol for intermittent connectivity. Buffer data locally. Retry with backoff. Your device should function (perhaps degraded) when completely offline.
- Not testing the rollback path. You tested the update. Did you test what happens when the update fails? Brick one test device intentionally to validate your A/B rollback mechanism.
- Using full Kubernetes on devices that don't need it. k3s on a 2GB Raspberry Pi works. Full Kubernetes on a 512MB industrial gateway doesn't. Not every edge device needs an orchestrator — sometimes a systemd unit file is the right answer.
- Ignoring power loss scenarios. Edge devices lose power unexpectedly. Use journaling filesystems (ext4, btrfs). Make boot robust against unclean shutdowns. Never update a partition without A/B swap.
- Metered data overruns. Your monitoring agent pushes 500MB/month of metrics over a $5/month cellular plan. Know your data budget. Aggregate, batch, and compress.
- Physical security of edge devices. Someone can walk up to your device and pull the SD card. Encrypt storage. Use secure boot. Assume physical access is possible.
War story: In early IoT deployments, several companies shipped devices with default credentials and no OTA update mechanism. When vulnerabilities were found, the only fix was physically visiting thousands of devices. The Mirai botnet (2016) exploited exactly this pattern -- scanning for IoT devices with factory-default passwords and conscripting them into a DDoS army that took down major DNS provider Dyn.
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Related Content¶
- /proc Filesystem (Topic Pack, L2) — Linux Fundamentals
- Advanced Bash for Ops (Topic Pack, L1) — Linux Fundamentals
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Linux Fundamentals
- Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Linux Fundamentals
- Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Linux Fundamentals
- Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Linux Fundamentals
- Case Study: Disk Full Root Services Down (Case Study, L1) — Linux Fundamentals
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Linux Fundamentals
- Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Linux Fundamentals
- Case Study: Inode Exhaustion (Case Study, L1) — Linux Fundamentals
Pages that link here¶
- /proc Filesystem
- Advanced Bash for Ops
- Anti-Primer: Edge IoT
- Disk Full Root - Services Down
- Edge & IoT Infrastructure
- Master Curriculum: 40 Weeks
- Symptoms
- Symptoms
- Symptoms
- Symptoms
- Symptoms
- Symptoms: CI Pipeline Fails, Docker Layer Cache Corruption, Fix Is Registry GC
- Symptoms: Container Image Vuln Scanner False Positive, Blocks Deploy Pipeline
- Symptoms: Disk Full Alert, Cause Is Runaway Logs, Fix Is Loki Retention
- Symptoms: HPA Flapping, Metrics Server Clock Skew, Fix Is NTP Config