iptables & nftables - Street-Level Ops¶

Real-world firewall diagnosis, emergency recovery, and production management workflows.

Task: Emergency Lockout Recovery¶

You flushed iptables rules over SSH and locked yourself out. The chain policy is DROP and there are no rules allowing anything.

# If you still have console access (IPMI, KVM, cloud serial console):
$ iptables -P INPUT ACCEPT
$ iptables -P FORWARD ACCEPT
$ iptables -P OUTPUT ACCEPT
$ iptables -F
$ iptables -X
# You're back. Now rebuild your rules properly.

# If you have a cron job or at job scheduled:
# (This is why you set one BEFORE making changes)
$ cat /var/spool/cron/crontabs/root
*/5 * * * * /sbin/iptables-restore < /etc/iptables/rules.v4
# Wait 5 minutes. Rules are restored. Remove the cron job.

# AWS/GCP/Azure: modify the security group from the console
# to allow SSH, then fix the instance's iptables via SSH.

# Last resort: stop the instance, attach the disk to another instance,
# edit /etc/iptables/rules.v4, detach, reattach, start.

Prevention: Always Set a Safety Net Before Changes¶

# Before touching rules on a remote machine:
$ iptables-save > /tmp/rules.backup
$ echo "/sbin/iptables-restore < /tmp/rules.backup" | at now + 5 minutes
# If you lock yourself out, wait 5 minutes.
# If everything works, cancel: atrm <job_number>

Task: Debugging Dropped Packets¶

Something is being blocked but you don't know which rule is doing it.

# Step 1: Check packet counters on all rules
$ iptables -L -n -v
# The first two columns are packet and byte counters.
# Look for rules with high counters that you didn't expect.

Chain INPUT (policy DROP 1423 packets, 87654 bytes)
 pkts bytes target     prot opt in     out     source       destination
 245K  198M ACCEPT     all  --  *      *       0.0.0.0/0    0.0.0.0/0    ctstate RELATED,ESTABLISHED
    0     0 ACCEPT     tcp  --  *      *       0.0.0.0/0    0.0.0.0/0    tcp dpt:22
#                                                                         ^^^^ 0 packets to SSH? Something is wrong.

# Step 2: Add a LOG rule to see what's being dropped
$ iptables -A INPUT -j LOG --log-prefix "IPT-DEBUG: " --log-level 4

# Watch the log
$ journalctl -kf | grep "IPT-DEBUG"
# Or: dmesg -w | grep "IPT-DEBUG"
# Output shows source IP, dest IP, protocol, port for every dropped packet

# Step 3: Check if it's a conntrack issue
$ conntrack -L | grep <suspicious_ip>
$ cat /proc/sys/net/netfilter/nf_conntrack_count
$ cat /proc/sys/net/netfilter/nf_conntrack_max
# If count == max, the table is full. New connections are dropped.

# Step 4: Check all tables, not just filter
$ iptables -t nat -L -n -v
$ iptables -t mangle -L -n -v
$ iptables -t raw -L -n -v

# Step 5: Check nftables too (rules might be in the nft backend)
$ nft list ruleset

# Step 6: Reset counters to isolate new traffic
$ iptables -Z    # Zero all counters
# Reproduce the problem, then check counters again
$ iptables -L -n -v

Task: Docker and iptables Interaction¶

Gotcha: Docker's -p 8080:80 bypasses your INPUT chain entirely. The traffic flows through PREROUTING (DNAT) and FORWARD, never touching INPUT. This means your carefully crafted INPUT rules do not protect Docker-published ports. Use the DOCKER-USER chain (below) for host-level filtering of container traffic.

Docker manages its own iptables chains. Understanding them prevents conflicts.

# See Docker's chains
$ iptables -L -n -v | grep -A5 DOCKER
$ iptables -t nat -L -n -v | grep -A5 DOCKER

# Docker creates these chains:
# DOCKER          — per-container DNAT rules for published ports
# DOCKER-ISOLATION-STAGE-1/2 — inter-network isolation
# DOCKER-USER     — YOUR rules go here (evaluated first)

# Problem: Docker published port bypasses your INPUT rules
# because it uses FORWARD + NAT, not INPUT.

# Solution: Use DOCKER-USER for host-level filtering of Docker traffic
$ iptables -I DOCKER-USER -i eth0 -p tcp --dport 8080 -j DROP
$ iptables -I DOCKER-USER -i eth0 -s 10.0.0.0/8 -p tcp --dport 8080 -j ACCEPT

# Verify Docker-published port access
$ iptables -L DOCKER-USER -n -v

# If Docker keeps overwriting your rules, the DOCKER-USER chain survives
# Docker restarts. Put your rules there.

# To see exactly what Docker configured:
$ iptables-save | grep -i docker

Task: Kubernetes and iptables (kube-proxy)¶

kube-proxy manages thousands of iptables rules for Service routing. Don't hand-edit these.

# See kube-proxy's rules
$ iptables-save | grep -c KUBE
# Output: 2847 (thousands of rules is normal for a large cluster)

# Trace a specific Service's rules
$ iptables-save | grep KUBE-SVC- | head -20
# Each KUBE-SVC-* chain handles one Kubernetes Service
# Each KUBE-SEP-* chain handles one endpoint (pod IP)

# Inspect a specific service
$ iptables -t nat -L KUBE-SVC-XXXXX -n -v
# Shows the probability-based load balancing rules to pod endpoints

# If kube-proxy rules look stale (pods are gone but rules remain):
$ kubectl -n kube-system rollout restart daemonset kube-proxy
# kube-proxy rebuilds its rules from the API server

# To check if iptables mode is even being used:
$ kubectl -n kube-system get configmap kube-proxy -o yaml | grep mode
# mode: "" or mode: "iptables" = iptables mode
# mode: "ipvs" = IPVS mode (different, more efficient)

# IPVS mode: rules are in IPVS, not iptables
$ ipvsadm -Ln    # Shows IPVS service/destination table

Task: Rate Limiting with hashlimit¶

hashlimit tracks rates per source IP (or other criteria), unlike limit which is global.

> **Under the hood:** iptables rules are evaluated sequentially -- the kernel walks the chain from top to bottom for every packet. With 100 rules, this is negligible. With 10,000+ rules (common in large Kubernetes clusters), the linear scan measurably increases latency per packet. This is why kube-proxy offers IPVS mode: IPVS uses hash tables for O(1) lookup instead of O(n) chain traversal.

# Limit each source IP to 10 SSH connections per minute
$ iptables -A INPUT -p tcp --dport 22 -m conntrack --ctstate NEW \
    -m hashlimit --hashlimit-above 10/minute \
    --hashlimit-mode srcip \
    --hashlimit-name ssh_throttle \
    -j DROP

# Limit HTTP requests per source IP (basic DDoS mitigation)
$ iptables -A INPUT -p tcp --dport 80 -m conntrack --ctstate NEW \
    -m hashlimit --hashlimit-above 50/second \
    --hashlimit-mode srcip \
    --hashlimit-name http_throttle \
    --hashlimit-burst 100 \
    -j DROP

# Check hashlimit state (entries expire after --hashlimit-htable-expire, default 10s)
$ cat /proc/net/ipt_hashlimit/ssh_throttle
# Shows per-IP rate tracking entries

Task: Port Forwarding¶

Forward traffic arriving at your gateway to an internal server.

# Forward port 443 on the gateway to internal web server 10.0.0.50
$ echo 1 > /proc/sys/net/ipv4/ip_forward
$ iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 443 -j DNAT --to-destination 10.0.0.50:443
$ iptables -A FORWARD -p tcp -d 10.0.0.50 --dport 443 -m conntrack --ctstate NEW -j ACCEPT
$ iptables -A FORWARD -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT

# Make ip_forward persistent
$ echo "net.ipv4.ip_forward = 1" >> /etc/sysctl.d/99-forward.conf
$ sysctl -p /etc/sysctl.d/99-forward.conf

# Verify it's working
$ conntrack -L | grep 10.0.0.50

Same with nftables¶

$ nft add table ip nat
$ nft add chain ip nat prerouting { type nat hook prerouting priority -100 \; }
$ nft add rule ip nat prerouting iif eth0 tcp dport 443 dnat to 10.0.0.50:443

Task: Blocking IP Ranges¶

# Block a single IP
$ iptables -I INPUT -s 203.0.113.50 -j DROP

# Block a CIDR range
$ iptables -I INPUT -s 198.51.100.0/24 -j DROP

# Block a country or large range with ipset (more efficient than many rules)
$ ipset create blocked_ips hash:net
$ ipset add blocked_ips 198.51.100.0/24
$ ipset add blocked_ips 203.0.113.0/24
$ ipset add blocked_ips 192.0.2.0/24
$ iptables -I INPUT -m set --match-set blocked_ips src -j DROP

# With nftables (built-in sets, no separate tool)
$ nft add set inet filter blocked { type ipv4_addr \; flags interval \; }
$ nft add element inet filter blocked { 198.51.100.0/24, 203.0.113.0/24 }
$ nft insert rule inet filter input ip saddr @blocked drop

# Dynamically update the set (no rule changes needed)
$ nft add element inet filter blocked { 192.0.2.0/24 }

# List set contents
$ ipset list blocked_ips          # iptables + ipset
$ nft list set inet filter blocked  # nftables

Task: Auditing Current Rules¶

Before making changes, understand what you're working with.

# Full rule dump in save format (most useful for analysis)
$ iptables-save
$ ip6tables-save

# Count rules per chain
$ iptables-save | grep -c "^-A"
# 47 rules (if thousands, likely kube-proxy or Docker)

# Find rules matching a specific port
$ iptables-save | grep "dport 80"

# Find rules matching a specific IP
$ iptables-save | grep "10.0.0.50"

# Check chain policies (the defaults when no rule matches)
$ iptables -L | grep "Chain.*policy"
Chain INPUT (policy DROP)
Chain FORWARD (policy DROP)
Chain OUTPUT (policy ACCEPT)

# With nftables
$ nft list ruleset
$ nft list chain inet filter input

# Compare rules to saved file (find drift)
$ diff <(iptables-save | sort) <(sort /etc/iptables/rules.v4)

Task: Testing Rules Before Committing¶

Remember: iptables-restore and nft -f are atomic -- they swap the entire ruleset in one kernel operation. There is no window where "some old rules and some new rules" are active. Always prefer iptables-restore < rules.v4 over a series of iptables -A commands, which create a brief inconsistent state between each command.

# Test with a timeout: rules auto-revert after N seconds
# (Not built into iptables, but you can script it)

# Method 1: at job for safety
$ iptables-save > /tmp/rules.before
$ echo "iptables-restore < /tmp/rules.before" | at now + 3 minutes
# Make your changes...
# If you lock yourself out, wait 3 minutes.
# If everything works: atrm <job_id>

# Method 2: nftables has atomic replacement
$ nft -f /tmp/new_rules.nft    # Replaces the entire ruleset atomically
# If it fails (syntax error), old rules remain. No window of no-rules.

# Method 3: iptables-restore is also atomic
$ iptables-restore < /tmp/new_rules.v4
# All rules replaced at once. If the file has an error, nothing changes.

Task: Monitoring Connection Tracking Table Health¶

Debug clue: If you see random connection failures under load but the server is not CPU or memory constrained, check the conntrack table. When nf_conntrack_count hits nf_conntrack_max, the kernel silently drops new connections. The only evidence is a dmesg line: nf_conntrack: table full, dropping packet.

# Current entries vs maximum
$ cat /proc/sys/net/netfilter/nf_conntrack_count
$ cat /proc/sys/net/netfilter/nf_conntrack_max

# If count approaches max, new connections are dropped silently!
# Increase the limit:
$ sysctl -w net.netfilter.nf_conntrack_max=262144

# Make persistent
$ echo "net.netfilter.nf_conntrack_max = 262144" >> /etc/sysctl.d/99-conntrack.conf

# Monitor conntrack table usage (for alerting)
$ awk '{printf "%.1f%%\n", ($1/$2)*100}' \
    <(cat /proc/sys/net/netfilter/nf_conntrack_count) \
    <(cat /proc/sys/net/netfilter/nf_conntrack_max)

# See connections by state
$ conntrack -L 2>/dev/null | awk '{print $4}' | sort | uniq -c | sort -rn
  12543 ESTABLISHED
   2341 TIME_WAIT
    891 CLOSE_WAIT
     23 SYN_SENT

# Drop stale conntrack entries for a specific IP
$ conntrack -D -s 10.0.0.99