Advanced Bash for Ops - Street-Level Ops¶
Real-world patterns and debugging techniques for production Bash scripting.
Quick Diagnosis Commands¶
# Debug a script without modifying it
bash -x ./myscript.sh # Trace every command
bash -xv ./myscript.sh # Trace + print each line before expansion
PS4='+(${BASH_SOURCE}:${LINENO}): ' bash -x ./myscript.sh # Show file:line in trace
# Check syntax without running
bash -n ./myscript.sh
# Find which shell is actually running
readlink -f /proc/$$/exe
echo "${BASH_VERSION}"
# Profile script execution time
time ./myscript.sh
# Per-command timing via DEBUG trap
trap 'echo "$(date +%s.%N) $BASH_COMMAND"' DEBUG
Gotcha: Unquoted Variables¶
The single most common Bash bug in production:
# BROKEN: filename with spaces causes word splitting
file="/var/log/my app.log"
rm $file # Runs: rm /var/log/my app.log (two arguments!)
# FIXED
rm "${file}"
Rule: Always double-quote variable expansions. The only exceptions are inside [[ ]] and on the right side of assignments.
Gotcha: Subshells Swallowing Variables¶
Pipes create subshells. Variables set in a subshell don't propagate:
# BROKEN: count stays 0
count=0
cat hosts.txt | while read -r host; do
count=$(( count + 1 ))
done
echo "${count}" # Still 0!
# FIXED: redirect instead of pipe
count=0
while read -r host; do
count=$(( count + 1 ))
done < hosts.txt
echo "${count}" # Correct
Gotcha: set -e and Command Substitution¶
Under the hood:
localis itself a command, and its exit code masks the exit code of the assignment. When Bash evaluateslocal result=$(failing_command), the failing command returns non-zero, butlocalsucceeds (it created the variable), soset -esees a zero exit code and continues. This is one of the most commonset -esurprises.
set -e doesn't catch failures inside $() in all contexts:
# This WON'T exit on failure
result=$(failing_command) # Exits due to set -e ✓
# This WON'T exit — assignment to local suppresses exit
local result=$(failing_command) # Bug! Exit code masked by 'local'
# FIXED: separate declaration and assignment
local result
result=$(failing_command)
Gotcha: Globbing in Variables¶
# DANGEROUS: variable contains glob chars
pattern="*.log"
rm ${pattern} # Expands to all .log files in CWD, not the literal string
# If you want literal glob:
rm -- "${pattern}" # Still globs! Only noglob prevents expansion
# For controlled globbing:
shopt -s nullglob # Empty result instead of literal glob when no match
for f in /var/log/*.log; do
echo "Processing: ${f}"
done
Pattern: Inventory-Driven Operations¶
# Parse a simple inventory file with groups
declare -A HOST_GROUPS
current_group="default"
while IFS= read -r line; do
line="${line%%#*}" # Strip comments
line="${line// /}" # Strip whitespace
[[ -z "${line}" ]] && continue
if [[ "${line}" == \[*\] ]]; then
current_group="${line//[\[\]]/}"
continue
fi
HOST_GROUPS["${current_group}"]+="${line} "
done < inventory.ini
# Use specific group
for host in ${HOST_GROUPS[webservers]}; do
ssh "${host}" 'uptime'
done
Pattern: Progress Reporting¶
total=${#HOSTS[@]}
completed=0
failed=0
report_progress() {
local pct=$(( completed * 100 / total ))
printf '\r[%3d%%] %d/%d done, %d failed' \
"${pct}" "${completed}" "${total}" "${failed}"
}
for host in "${HOSTS[@]}"; do
if ssh -o ConnectTimeout=5 "${host}" 'uptime' &>/dev/null; then
(( completed++ ))
else
(( failed++ ))
(( completed++ ))
FAILED_HOSTS+=("${host}")
fi
report_progress
done
echo # Newline after progress
Pattern: Safe Temp Files¶
# Always use mktemp, never hardcoded paths
WORK_DIR=$(mktemp -d "${TMPDIR:-/tmp}/fleet-ops.XXXXXXXXXX")
trap 'rm -rf "${WORK_DIR}"' EXIT
# Per-host output files
for host in "${HOSTS[@]}"; do
ssh "${host}" 'df -h' > "${WORK_DIR}/${host}.out" 2>&1
done
# Aggregate results
cat "${WORK_DIR}"/*.out | sort -k5 -rn | head -20
Pattern: SSH Multiplexing for Fleet Operations¶
# Set up a control socket for reuse
SSH_OPTS=(
-o ConnectTimeout=5
-o StrictHostKeyChecking=accept-new
-o ControlMaster=auto
-o ControlPath="/tmp/ssh-%r@%h:%p"
-o ControlPersist=300
)
# First connection establishes the master
ssh "${SSH_OPTS[@]}" "${host}" 'uptime'
# Subsequent connections reuse the TCP connection (near-instant)
ssh "${SSH_OPTS[@]}" "${host}" 'free -m'
ssh "${SSH_OPTS[@]}" "${host}" 'df -h /'
Pattern: Rolling Operations with Circuit Breaker¶
MAX_FAILURES=3
failure_count=0
for host in "${HOSTS[@]}"; do
if (( failure_count >= MAX_FAILURES )); then
log ERROR "Circuit breaker: ${MAX_FAILURES} consecutive failures. Halting."
log ERROR "Remaining hosts not processed: $(( ${#HOSTS[@]} - completed ))"
exit ${E_PARTIAL}
fi
if process_host "${host}"; then
failure_count=0 # Reset on success
else
(( failure_count++ ))
FAILED_HOSTS+=("${host}")
log WARN "Failure ${failure_count}/${MAX_FAILURES} on ${host}"
fi
done
Emergency: Debugging a Hung Script¶
# Find the script process
pgrep -f myscript.sh
ps auxf | grep myscript
# See what it's doing right now
strace -p <PID> -e trace=network,write
cat /proc/<PID>/wchan # What kernel function it's waiting in
ls -la /proc/<PID>/fd/ # What files/sockets it has open
# See the script's environment
cat /proc/<PID>/environ | tr '\0' '\n'
# Send it a signal to trigger the trap handler
kill -USR1 <PID>
Emergency: Recovering from a Botched Fleet Script¶
# 1. Kill the runaway process
pkill -f fleet-patch.sh
# 2. Check which hosts were touched
grep 'Processing' /var/log/fleet-ops.log | tail -50
# 3. Identify partial state
for host in $(cat /var/log/fleet-ops.log | grep FAILED | awk '{print $NF}'); do
echo "--- ${host} ---"
ssh "${host}" 'systemctl status nginx; rpm -q nginx' 2>&1
done
# 4. Clean up lock files
rm -rf /var/run/fleet-patch.lock
Useful One-Liners for Ops¶
# Find large files modified in last 24h
find / -xdev -type f -mtime -1 -size +100M -printf '%s %p\n' 2>/dev/null | sort -rn
# Watch for OOM kills in real-time
dmesg -wH | grep -i 'oom\|killed process'
# Quick disk usage sorted by size
du -xsh /* 2>/dev/null | sort -rh | head -15
# Check all systemd units in failed state
systemctl list-units --state=failed --no-legend --no-pager
# Parallel ping sweep
printf '%s\n' 10.0.1.{1..254} | xargs -P 50 -I{} sh -c 'ping -c1 -W1 {} &>/dev/null && echo "{} up"'
Power One-Liners¶
Bash tricks, shortcuts, and patterns that separate beginners from power users.
Re-run last command as root¶
Breakdown: !! is bash history expansion for "the entire previous command." Bash expands it before execution, so sudo wraps whatever you just ran. The expanded command is echoed before running.
[!TIP] When to use: You ran a privileged command without sudo. Muscle memory saver.
Quick fix typo in previous command¶
Breakdown: Bash ^old^new substitution — shorthand for !!:s/old/new/. Only replaces the first occurrence. For global replace: !!:gs/old/new/.
[!TIP] When to use: Fat-fingered a path, hostname, or flag. Faster than up-arrow and editing.
Open $EDITOR for complex command composition¶
Breakdown: Readline shortcut that opens $EDITOR (or $VISUAL) with the current command line. Write your multi-line pipeline, save and quit — it executes. Works in bash and zsh.
[!TIP] When to use: Building complex pipelines, writing inline awk scripts, anything longer than ~80 chars.
Cycle through previous command arguments¶
Breakdown: Inserts the last argument of the previous command. Press repeatedly to cycle through last arguments of earlier commands. ESC . works when ALT is captured by terminal emulator.
[!TIP] When to use: Reusing file paths, hostnames, or long arguments across sequential commands.
Kill/yank line for mid-command research¶
Breakdown: ctrl-u kills (cuts) from cursor to beginning of line into the kill ring. ctrl-y yanks (pastes) it back. The line is preserved even after running other commands in between.
[!TIP] When to use: Halfway through typing a command and need to check a path, PID, or hostname first.
Most frequently used commands¶
Breakdown: Classic awk frequency counter on field 2 (the command name from history output). Reveals your actual workflow patterns — useful for identifying alias candidates.
[!TIP] When to use: Optimizing your shell workflow, creating aliases, identifying automation candidates.
Brace expansion for quick rename/backup¶
mv config.yaml{,.bak} # config.yaml -> config.yaml.bak
cp app.py{,.$(date +%s)} # timestamped backup
mkdir -p project/{src,tests,docs,scripts} # scaffold dirs
Breakdown: Bash expands {a,b} before the command runs. Empty element in {,.bak} means the original name is the first arg, .bak appended is the second. Works with any command.
[!TIP] When to use: Quick backups before editing, project scaffolding, batch file operations.
Process substitution for comparing outputs¶
diff <(sort file1) <(sort file2)
diff <(ssh host1 cat /etc/config) <(ssh host2 cat /etc/config)
diff <(cd dir1 && find | sort) <(cd dir2 && find | sort)
Breakdown: <(cmd) creates a temporary file descriptor containing the command's stdout. Lets you diff, comm, or paste outputs of two commands without temp files. Works in bash/zsh, not sh.
[!TIP] When to use: Comparing config across hosts, validating deployments, diffing directory trees.
Delete files NOT matching a pattern¶
Breakdown: extglob enables extended pattern matching. !(pattern) matches everything except the pattern. Without extglob, bash doesn't support negation in globs.
[!TIP] When to use: Cleaning up a directory while preserving specific file types.
Quick directory stack navigation¶
pushd /var/log # push current dir, cd to /var/log
pushd /etc # push /var/log, cd to /etc
popd # back to /var/log
dirs -v # show numbered stack
Breakdown: pushd maintains a LIFO stack of directories. popd pops and cd's. dirs -v shows the stack with indices. cd ~2 jumps to stack position 2. Vastly superior to cd - for multi-dir workflows.
[!TIP] When to use: Bouncing between log dirs, config dirs, and source dirs during debugging.
Timestamped history for forensics¶
Breakdown: Sets history to record timestamps. %F = YYYY-MM-DD, %T = HH:MM:SS. Persists across sessions if added to .bashrc. Makes history output include when each command was run.
[!TIP] When to use: Post-incident forensics — "what commands were run and when?" Also useful for compliance auditing.
Close shell keeping background jobs alive¶
Breakdown: disown -a removes ALL jobs from the shell's job table so they won't receive SIGHUP when the shell exits. Then exit closes cleanly. Alternative: start with nohup or use tmux/screen.
[!TIP] When to use: Started a long-running process (build, transfer, migration) and need to disconnect.
Stream fan-out — one pipeline, many consumers¶
some_command | tee >(gzip > compressed.gz) >(sha256sum > checksum.txt) >(wc -l > linecount.txt) | head
Breakdown: tee copies stdin to each >(cmd) process substitution while still passing the original stream to stdout for the final pipeline stage. Each >(cmd) runs in its own subshell concurrently. This is shell dataflow programming — you're building a mini DAG in one line.
[!TIP] When to use: Simultaneously compress, hash, count, and preview a data stream. Also: writing to multiple log sinks, fan-out to parallel processors, splitting CI build artifacts.
Caveat: >(cmd) runs asynchronously — the main pipeline may finish before all branches complete. For critical writes, add wait after.
Compare state over time (diff as a time machine)¶
diff <(lsof -p 1234) <(sleep 10; lsof -p 1234)
diff <(ss -tn) <(sleep 30; ss -tn)
diff <(env) <(sleep 5; env)
Breakdown: First process substitution captures state now. Second captures state after a delay. diff shows what changed. Works with anything that produces text: ps, ss, netstat, iptables -L, mount, env.
[!TIP] When to use: "Something is leaking file descriptors / connections / memory." Snapshot, wait, snapshot, diff. The poor man's profiler.
Duplicate a disk to multiple targets simultaneously¶
Breakdown: dd reads the source once. tee + process substitution fans the stream to /dev/sdb while the main pipeline writes to /dev/sdc. One read pass, two write targets, concurrent.
[!TIP] When to use: Cloning a golden image to multiple drives, preparing identical servers.
Caveat: One typo in the of= targets and you're speedrunning regret. Triple-check device names with lsblk first.
Quick Reference¶
- Cheatsheet: Bash