Edge & IoT Infrastructure - Street-Level Ops¶
What experienced edge operators know about managing devices you can't walk up to and reboot.
Quick Diagnosis Commands¶
# Check cellular modem status
mmcli -m 0 # Modem state and signal
mmcli -m 0 --signal-get # Signal quality (RSSI, RSRP)
mmcli -m 0 --location-get # GPS position (if modem supports)
# Check connectivity to control plane
ping -c 3 -W 5 control-plane.example.com # Basic reachability
curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" \
https://control-plane.example.com/health
# Check WireGuard tunnel
wg show # Tunnel status, last handshake
# If "latest handshake" is >2 minutes, the tunnel is likely dead
# Check disk health on flash storage
smartctl -a /dev/sda # SMART data (if supported)
cat /sys/fs/ext4/sda1/lifetime_write_kbytes # Total writes to flash
# Check OTA update status
rauc status # A/B partition status
swupdate --status # SWUpdate status
cat /etc/os-release # Running version
# Check system resource usage on constrained device
free -m # Memory (MB matters here)
df -h # Disk usage
uptime # Load average
top -bn1 | head -20 # Top processes
journalctl --disk-usage # Journal consuming disk?
Gotcha: Flash Storage Wear-Out¶
Your edge device runs on an SD card or eMMC. After 18 months, the device starts throwing I/O errors and becomes read-only. The flash storage wore out from constant writes — logs, metrics buffers, temporary files.
Fix: Minimize writes to flash:
# Mount /tmp and /var/log as tmpfs (RAM-backed)
echo "tmpfs /tmp tmpfs defaults,noatime,nosuid,size=64m 0 0" >> /etc/fstab
echo "tmpfs /var/log tmpfs defaults,noatime,nosuid,size=32m 0 0" >> /etc/fstab
# Reduce journald writes
# /etc/systemd/journald.conf
Storage=volatile # Write to RAM only
RuntimeMaxUse=16M # Cap journal size in RAM
# Disable filesystem access time updates
# In /etc/fstab, add noatime to all flash partitions
/dev/sda1 / ext4 defaults,noatime 0 1
# Use F2FS instead of ext4 for flash-optimized filesystem
mkfs.f2fs /dev/sda2
lifetime_write_kbytes and alert before it hits the manufacturer's rated write endurance.
Gotcha: Reverse SSH Tunnel Drops Silently¶
Your edge device maintains a reverse SSH tunnel to a bastion for remote access. The tunnel drops because the cellular connection hiccupped. autossh doesn't reconnect because the old socket is still hanging on the server side.
Fix: Configure aggressive keepalives on both sides:
# On the edge device (autossh command):
autossh -M 0 -f -N -R 2222:localhost:22 user@bastion \
-o "ServerAliveInterval 15" \
-o "ServerAliveCountMax 3" \
-o "ExitOnForwardFailure yes" \
-o "TCPKeepAlive yes" \
-o "ConnectTimeout 10"
# On the bastion (/etc/ssh/sshd_config):
ClientAliveInterval 15
ClientAliveCountMax 3
# Also: use a systemd service with restart policy
# /etc/systemd/system/ssh-tunnel.service
[Unit]
Description=Reverse SSH tunnel to bastion
After=network-online.target
Wants=network-online.target
[Service]
ExecStart=/usr/bin/autossh -M 0 -N -R 2222:localhost:22 user@bastion \
-o "ServerAliveInterval 15" -o "ServerAliveCountMax 3"
Restart=always
RestartSec=30
[Install]
WantedBy=multi-user.target
Gotcha: OTA Update Bricks the Device¶
War story: A fleet of 2,000 weather stations received an OTA update that changed the NTP server address. The new NTP server was behind a firewall that blocked the devices' source IPs. Devices booted the new image, failed to sync time, TLS certificate validation failed (clock was wrong), and the control plane became unreachable. Recovery required physical site visits.
You pushed a firmware update to 500 edge devices. The update has a bug in the network configuration. Now 500 devices are online but unreachable — they booted the new image successfully (so no automatic rollback triggered), but the network config is wrong.
Fix: Implement a phone-home health check as part of the update validation:
# In the post-update boot script:
MAX_RETRIES=3
HEALTH_URL="https://control-plane.example.com/device/health"
for i in $(seq 1 $MAX_RETRIES); do
if curl -s -o /dev/null -w "%{http_code}" "$HEALTH_URL" | grep -q 200; then
# Mark this partition as good
rauc status mark-good
exit 0
fi
sleep 60
done
# If we get here, health check failed — rollback
rauc status mark-bad
reboot # Boot loader will switch to previous partition
Gotcha: Time Drift on Devices Without NTP¶
Edge devices with no reliable internet lose time sync. Certificates fail validation because the device clock thinks it's 1970. TLS connections to the control plane fail. The device can't pull updates because every HTTPS connection errors out.
Fix: Multiple time sync strategies:
# Primary: NTP when connected
timedatectl set-ntp true
# Fallback: GPS time (if modem has GPS)
# Parse NMEA sentences for time
gpspipe -w | grep -m1 GPRMC | cut -d',' -f2
# Fallback: save time to disk before shutdown, restore on boot
# /etc/systemd/system/fake-hwclock.service
# Writes current time to /etc/fake-hwclock.data on shutdown
# Reads it on boot — at least you're in the right year
# Fallback: set time from HTTP headers on first connection
date -s "$(curl -sI https://control-plane.example.com | \
grep -i date | sed 's/Date: //')"
Gotcha: Cellular Data Budget Blown by Uncompressed Logs¶
Scale note: Gzip compression on JSON logs typically achieves 10:1 reduction. A device sending 10 MB/day drops to 1 MB/day compressed. Across 500 devices, that is the difference between 150 GB/month and 15 GB/month of cellular data -- often the difference between a viable project and a killed project.
Your monitoring agent sends raw JSON logs over LTE. A chatty application generates 10MB/day of logs. With 500 devices, that's 5GB/day of cellular data. Your monthly bill spikes from $2,000 to $15,000.
Fix: Compress, batch, and budget:
# Telegraf/Vector: batch and compress before sending
# vector.toml
[sinks.central]
type = "http"
compression = "gzip"
batch.max_bytes = 1048576 # 1MB batches
batch.timeout_secs = 300 # Or every 5 minutes
encoding.codec = "json"
# Set a hard data budget per device
# tc rate limiting on the cellular interface
tc qdisc add dev wwan0 root tbf rate 50kbit burst 10kb latency 50ms
# Monitor data usage per interface
cat /sys/class/net/wwan0/statistics/tx_bytes
cat /sys/class/net/wwan0/statistics/rx_bytes
# Or: vnstat -i wwan0 -m # Monthly data tracking
Pattern: Edge Device Provisioning Pipeline¶
┌─────────────────────────────────────────────────────┐
│ Factory / Bench Setup │
│ 1. Flash base image (A/B partitions) │
│ 2. Inject device identity (UUID, certs, keys) │
│ 3. Configure initial network (DHCP + VPN) │
│ 4. First boot: device phones home to control plane │
│ 5. Control plane enrolls device in fleet management│
│ 6. Device pulls its configuration (Ansible/Salt) │
│ 7. Device starts services │
│ 8. Monitoring confirms healthy │
└─────────────────────────────────────────────────────┘
Ongoing Management:
├── OTA updates via A/B partition swap
├── Configuration changes via MQTT command channel
├── Telemetry via MQTT + local buffering
├── Remote access via reverse SSH / WireGuard
└── Alerting via control plane monitoring
Pattern: Store-and-Forward for Unreliable Networks¶
When the network is intermittent, queue everything locally:
# Use SQLite as a local buffer for metrics/events
# On the edge device:
# 1. Application writes data to local SQLite
# 2. Uploader service reads from SQLite, sends to server
# 3. On successful send, mark records as uploaded
# 4. If send fails, retry on next connectivity window
# Alternative: use systemd-journal as the buffer
# journald stores structured events locally
# A forwarder (Vector, Fluent Bit) ships them when connected
# Fluent Bit config for store-and-forward:
[INPUT]
Name tail
Path /var/log/app/*.log
storage.type filesystem # Buffer to disk when network is down
[OUTPUT]
Name http
Match *
Host control-plane.example.com
Port 443
TLS On
Retry_Limit 100 # Keep retrying
storage.total_limit_size 50M # Cap local buffer
Pattern: Watchdog Timer for Unrecoverable Hangs¶
Default trap: Most SBCs (Raspberry Pi, BeagleBone) have hardware watchdog support disabled by default. The kernel module (
bcm2835_wdton Pi) must be loaded, andRuntimeWatchdogSecmust be set in systemd. Without this, a kernel panic or deadlock leaves the device unrecoverable until someone physically power-cycles it.
Edge devices that hang need to recover automatically — nobody is going to reboot them:
# Hardware watchdog: if the OS doesn't pet the watchdog,
# the hardware reboots the device
# Enable the hardware watchdog
echo 1 > /dev/watchdog
# Systemd manages the watchdog automatically:
# /etc/systemd/system.conf
RuntimeWatchdogSec=30 # Reboot if system hangs for 30s
ShutdownWatchdogSec=10min # Force reboot if shutdown hangs
# Application-level watchdog:
# Your app must write to the watchdog periodically
# If it hangs, the device reboots
# systemd service with WatchdogSec:
[Service]
Type=notify
WatchdogSec=60 # App must call sd_notify("WATCHDOG=1")
# within 60s or systemd restarts it
Emergency: Mass Fleet Bricking from Bad Update¶
You pushed an update. Devices are boot-looping. The A/B rollback is working but the old partition has an expired certificate that prevents it from talking to the control plane. Devices are alive but unreachable.
1. Stop the rollout immediately
- Halt the OTA server from sending updates
- Whatever percentage has NOT updated, protect them
2. For devices stuck on the bad version:
- If reverse SSH tunnel still works: push a fix via SSH
- If WireGuard tunnel still works: push a fix via VPN
- If neither works: do they have a fallback check-in method?
3. Recovery options (worst to best):
a. Physical visit (send a tech with a USB stick)
b. SMS command channel (if cellular modem supports AT commands)
c. MQTT retained message (device picks up fix on next connection)
d. DNS-based command (device resolves a special hostname for instructions)
4. Prevention:
- Always maintain a "recovery" partition or boot mode
- Canary updates: 1 device, then 10, then 100, then all
- 24-hour soak between phases
- Certificate expiry must be checked before update AND after rollback
Emergency: Edge Device Compromised¶
An edge device in the field is behaving suspiciously — unexpected outbound connections, unusual CPU usage, or the control plane received tampered data.
1. Isolate: revoke the device's certificates on the control plane
- Remove it from the fleet management system
- Block its VPN keys
- It can no longer authenticate to your infrastructure
2. If you can still access it:
# Capture evidence
netstat -tlnp # Listening ports
ps auxf # Process tree
find / -mtime -1 -type f # Recently modified files
cat /var/log/auth.log # Login history
# DO NOT reboot yet — volatile evidence will be lost
3. If you can't access it:
- It's a write-off for forensics
- Revoke all credentials it had access to
- Check what data it could have exfiltrated
4. Remediation:
- Flash from a known-good image (don't try to "clean" it)
- Rotate any shared secrets the device knew
- Review: how did the compromise happen?
- Was it a software vulnerability? Physical tampering?