Skip to content

Portal | Level: L2: Operations | Topics: Edge & IoT Infrastructure, Linux Fundamentals | Domain: DevOps & Tooling

Edge & IoT Infrastructure - Primer

Why This Matters

Not all infrastructure lives in a data center or cloud region with reliable power, fast networks, and physical access. Edge computing pushes workloads to locations where the network is slow, the hardware is constrained, and you can't SSH in when things break. If you've managed fleet-scale Linux infrastructure, edge extends those skills into hostile environments: cell towers, retail stores, factory floors, oil rigs, and moving vehicles. The patterns are the same — configuration management, monitoring, updates, security — but the constraints force different solutions.

The edge is growing fast. Every industry that deploys hardware in the field needs ops people who understand how to manage Linux devices they can't physically touch, over networks that drop packets like a sieve.

Name origin: "Edge computing" gets its name from network topology diagrams, where end-user devices and remote locations sit at the "edge" of the network graph, far from the centralized core or cloud.

Core Concepts

1. The Edge Spectrum

Edge isn't a single thing — it's a spectrum of distance from your control plane:

                    Network Quality
                    ◄────────────────►
  Data Center    Near Edge     Far Edge     Deep Edge
  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
  │ Full K8s │  │ k3s      │  │ Minimal  │  │ Bare     │
  │ clusters │  │ MicroK8s │  │ Linux    │  │ RTOS     │
  │ 100 Gbps │  │ 1 Gbps   │  │ LTE/4G   │  │ LoRa     │
  │ Always on│  │ Usually  │  │ Intermit.│  │ Sporadic │
  │ Physical │  │ Remote   │  │ No local │  │ Can't    │
  │ access   │  │ hands    │  │ hands    │  │ reach    │
  └──────────┘  └──────────┘  └──────────┘  └──────────┘
  AWS region     Retail store  Cell tower    Agricultural
  Colo/DC        Branch office Vehicle       sensor
                 Factory floor Pipeline      Subsea cable

2. Lightweight Linux for Constrained Devices

Standard Ubuntu Server is 2+ GB. On an edge device with 4GB of storage and 512MB RAM, you need a smaller OS:

Alpine Linux:

# ~5MB base image, musl libc, busybox userland
# Great for containers and lightweight VMs
# Package manager: apk
apk update
apk add docker openssh wireguard-tools

# WARNING: musl libc != glibc
# Some binaries compiled for glibc won't work on Alpine
# Python packages with C extensions may need --no-binary

Yocto Project (custom build system):

# Build a custom Linux image from scratch
# You choose: kernel config, packages, init system, everything
# Result: an OS image tailored to your exact hardware + use case

# Basic workflow:
git clone git://git.yoctoproject.org/poky
source oe-init-build-env
bitbake core-image-minimal     # ~8MB image

# Add layers for your hardware and software:
bitbake-layers add-layer meta-raspberrypi
bitbake-layers add-layer meta-openembedded

Buildroot (simpler alternative to Yocto):

# Simpler than Yocto, generates complete embedded Linux images
make menuconfig    # Select packages, kernel config, bootloader
make               # Build the entire image
# Output: rootfs image, kernel, bootloader — ready to flash

Distro Image Size Best For Learning Curve
Alpine ~5 MB Containers, lightweight servers Low
Yocto Custom (8MB+) Production embedded devices High
Buildroot Custom (5MB+) Simpler embedded builds Medium
Ubuntu Core ~200 MB Snap-based IoT devices Low
Raspberry Pi OS Lite ~400 MB Pi-based projects Low

3. Remote Fleet Management

Managing devices you can't physically access requires different tools:

┌──────────────────────────────────────────────────┐
│  Control Plane (your data center / cloud)        │
│  ├── Fleet management server                     │
│  ├── OTA update server                           │
│  ├── Monitoring / telemetry ingestion            │
│  └── Configuration management (Ansible/Salt)     │
└────────────────────┬─────────────────────────────┘
                     │  (internet / cellular / satellite)
                     │  unreliable, high-latency, metered
    ┌────────────────┼────────────────────┐
    │                │                    │
┌───▼────┐     ┌────▼───┐         ┌──────▼───┐
│ Edge   │     │ Edge   │         │ Edge     │
│ Site A │     │ Site B │         │ Site C   │
│ (k3s)  │     │ (bare) │         │ (offline │
│        │     │        │         │  48hrs)  │
└────────┘     └────────┘         └──────────┘

SSH over unreliable connections:

# autossh: maintains persistent SSH tunnels with auto-reconnect
autossh -M 0 -f -N -R 2222:localhost:22 user@bastion \
  -o "ServerAliveInterval 30" \
  -o "ServerAliveCountMax 3" \
  -o "ExitOnForwardFailure yes"

# Now SSH to the edge device via the bastion:
ssh -p 2222 user@bastion

# Mosh: mobile shell — survives network changes, roaming, sleep
mosh user@edge-device    # Uses UDP, handles high latency/packet loss

Name origin: MQTT originally stood for "MQ Telemetry Transport" (the "MQ" from IBM's MQSeries message queue product). Andy Stanford-Clark (IBM) and Arlen Nipper invented it in 1999 to monitor oil pipelines over expensive satellite links. Since 2013, MQTT is officially just a name, not an acronym.

MQTT for device telemetry:

# MQTT: lightweight pub/sub protocol designed for constrained networks
# Broker: mosquitto (runs on your control plane)
apt install mosquitto mosquitto-clients

# Edge device publishes metrics:
mosquitto_pub -h broker.example.com -t "edge/site-a/cpu" -m "45.2" -q 1

# Control plane subscribes:
mosquitto_sub -h broker.example.com -t "edge/+/cpu"
# QoS levels:
#   0 = fire and forget (may lose messages)
#   1 = at least once (may duplicate)
#   2 = exactly once (highest overhead)

4. OTA (Over-The-Air) Update Strategies

Updating remote devices safely is one of the hardest problems in edge computing:

A/B Partition Scheme:

┌────────────────────────────────────────┐
│  Device Storage Layout                 │
│                                         │
│  ┌─────────┐  ┌─────────┐  ┌────────┐ │
│  │ Part A  │  │ Part B  │  │ Data   │ │
│  │ (active)│  │ (update)│  │ (perst)│ │
│  │ v1.2.0  │  │ v1.3.0  │  │        │ │
│  └─────────┘  └─────────┘  └────────┘ │
│                                         │
│  Boot: A (current) → if fails → B      │
│  Update: write to B → reboot → try B   │
│  Rollback: if B fails 3x → back to A   │
└────────────────────────────────────────┘

# SWUpdate: open-source OTA update framework for embedded Linux
# Edge device checks for updates and applies A/B swap:
swupdate -v -i update.swu -e "stable,upgrade_partition_b"

# RAUC (another A/B update framework):
rauc install update.raucb
rauc status    # Show which partition is active

Update safety rules:

1. NEVER update all devices at once
   - Canary: update 1% of fleet, wait, observe
   - Phased: 1%  5%  25%  100%
   - Each phase: minimum 24-hour soak period

2. ALWAYS have automatic rollback
   - Boot counter: if new version fails to boot 3 times, revert
   - Health check: if new version fails health check, revert
   - Watchdog timer: if system doesn't ping home within N minutes, revert

3. NEVER update the bootloader unless absolutely necessary
   - A bad bootloader update = bricked device
   - A bad OS update with A/B = recoverable

Remember: OTA rollback rule of thumb: "A-B-C -- Always Be Canary-ing." Never push firmware to 100% of your fleet at once. The A/B partition scheme gives you a safety net, but canary rollout percentages keep the blast radius small.

5. Edge Kubernetes (k3s, MicroK8s, KubeEdge)

┌──────────────────────────────────────────────────────────┐
│  k3s (Rancher)                                           │
│  - Single binary, ~70MB                                  │
│  - Full Kubernetes API compatibility                     │
│  - SQLite or etcd backend                                │
│  - ARM64 + AMD64                                         │
│  - Best for: edge sites with 1-10 nodes                  │
├──────────────────────────────────────────────────────────┤
│  MicroK8s (Canonical)                                    │
│  - Snap-based, single node or clustered                  │
│  - Add-ons: istio, gpu, registry, dns                    │
│  - Best for: Ubuntu-based edge, developer workstations   │
├──────────────────────────────────────────────────────────┤
│  KubeEdge (CNCF)                                         │
│  - Cloud part (CloudCore) + Edge part (EdgeCore)         │
│  - Works with intermittent connectivity                  │
│  - Edge nodes can operate independently when disconnected│
│  - Best for: thousands of edge devices managed centrally │
└──────────────────────────────────────────────────────────┘
# k3s on a Raspberry Pi:
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--disable=traefik \
  --write-kubeconfig-mode 644 \
  --node-name edge-site-001" sh -

# KubeEdge: install EdgeCore on edge device
keadm join --cloudcore-ipport=control-plane:10000 \
  --edgenode-name=edge-site-001 \
  --kubeedge-version=1.15.0 \
  --token=<join-token>

6. Monitoring with Limited Bandwidth

Standard Prometheus scraping doesn't work when your edge device is on a 1 Mbps cellular connection with metered data:

Strategy 1: Edge-local Prometheus + remote write (aggregated)
┌────────────────┐              ┌──────────────────┐
│  Edge Device   │   push       │  Central Prom    │
│  ┌──────────┐  │─────────────→│  (receives       │
│  │ Prom     │  │  (batched,   │   aggregated     │
│  │ (local)  │  │   compressed)│   metrics)       │
│  └──────────┘  │              └──────────────────┘
└────────────────┘

Strategy 2: Telegraf/Vector with batching
┌────────────────┐              ┌──────────────────┐
│  Edge Device   │   batch      │  Central TSDB    │
│  ┌──────────┐  │─────────────→│  (InfluxDB /     │
│  │ Telegraf  │  │  every 5min  │   Prometheus)    │
│  │ + buffer  │  │  or when     │                  │
│  └──────────┘  │  connected   └──────────────────┘
└────────────────┘

Strategy 3: MQTT-based metrics
┌────────────────┐              ┌──────────────────┐
│  Edge Device   │   MQTT       │  Broker →        │
│  ┌──────────┐  │─────────────→│  Telegraf →      │
│  │ Custom   │  │  QoS 1       │  Prometheus      │
│  │ exporter │  │  (tiny       │                   │
│  └──────────┘  │   payload)   └──────────────────┘
└────────────────┘

7. Security for Remote Devices

Devices you can't physically access are the hardest to secure:

# Mandatory security measures for edge devices:
# 1. Encrypted storage (LUKS)
cryptsetup luksFormat /dev/sda2
cryptsetup open /dev/sda2 data_crypt

# 2. Signed OTA updates (don't accept unsigned firmware)
# In your update server:
openssl dgst -sha256 -sign private.key -out update.sig update.swu
# On device: verify before applying
openssl dgst -sha256 -verify public.key -signature update.sig update.swu

# 3. Automatic certificate rotation
# Use short-lived certs (hours, not years)
# Device requests new cert via mTLS to CA

# 4. Firewall: deny all, allow specific outbound
iptables -P OUTPUT DROP
iptables -A OUTPUT -d <control-plane-ip> -p tcp --dport 443 -j ACCEPT
iptables -A OUTPUT -d <mqtt-broker-ip> -p tcp --dport 8883 -j ACCEPT
iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# 5. Read-only root filesystem
# Mount root as ro, use overlayfs for writable layers
# This prevents persistent malware

8. Cellular Failover

When wired internet is unreliable, cellular provides a backup path:

# Using ModemManager + NetworkManager for cellular failover
nmcli connection add type gsm \
  con-name "cellular-backup" \
  ifname cdc-wdm0 \
  apn "internet" \
  connection.autoconnect-priority -10    # Lower than wired

# Automatic failover: NetworkManager handles this
# When wired goes down, cellular activates
# When wired comes back, traffic switches back

# Monitor connectivity:
mmcli -m 0              # Modem status
mmcli -m 0 --signal-get # Signal strength
nmcli device status      # Active connections

Common Pitfalls

  • Assuming the network is reliable. It isn't. Design every protocol for intermittent connectivity. Buffer data locally. Retry with backoff. Your device should function (perhaps degraded) when completely offline.
  • Not testing the rollback path. You tested the update. Did you test what happens when the update fails? Brick one test device intentionally to validate your A/B rollback mechanism.
  • Using full Kubernetes on devices that don't need it. k3s on a 2GB Raspberry Pi works. Full Kubernetes on a 512MB industrial gateway doesn't. Not every edge device needs an orchestrator — sometimes a systemd unit file is the right answer.
  • Ignoring power loss scenarios. Edge devices lose power unexpectedly. Use journaling filesystems (ext4, btrfs). Make boot robust against unclean shutdowns. Never update a partition without A/B swap.
  • Metered data overruns. Your monitoring agent pushes 500MB/month of metrics over a $5/month cellular plan. Know your data budget. Aggregate, batch, and compress.
  • Physical security of edge devices. Someone can walk up to your device and pull the SD card. Encrypt storage. Use secure boot. Assume physical access is possible.

War story: In early IoT deployments, several companies shipped devices with default credentials and no OTA update mechanism. When vulnerabilities were found, the only fix was physically visiting thousands of devices. The Mirai botnet (2016) exploited exactly this pattern -- scanning for IoT devices with factory-default passwords and conscripting them into a DDoS army that took down major DNS provider Dyn.


Wiki Navigation

Prerequisites