Portal | Level: L1: Foundations | Topics: Bash / Shell Scripting, Linux Fundamentals | Domain: Linux
Advanced Bash for Ops - Primer¶
Why This Matters¶
Bash is the lingua franca of infrastructure. Every server has it. Every CI pipeline runs it. Every runbook assumes it. Most DevOps engineers can write a basic script, but production Bash — the kind that runs unattended at 3 AM across 1,500 servers — requires discipline that casual scripting never teaches. Bad Bash is the #1 source of self-inflicted outages in operations teams.
Core Principles¶
1. Defensive Defaults¶
Every production script should start with strict mode:
| Flag | Effect |
|---|---|
-e |
Exit immediately on non-zero return |
-u |
Treat unset variables as errors |
-o pipefail |
Pipe fails if any command in the chain fails |
Without these, a script can silently fail halfway through and continue executing destructive operations on stale state.
Remember: Mnemonic for
set -euo pipefail: "E-U-P: Exit on errors, Unset vars are errors, Pipes fail properly." Think of it as the safety harness for production bash. Some teams enforce it via shellcheck rules (SC2086, SC2154).
2. Trap Handlers¶
Clean up resources when scripts exit — whether normally, on error, or on signal:
TMPDIR=$(mktemp -d)
LOCKFILE="/var/run/myprocess.lock"
cleanup() {
rm -rf "${TMPDIR}"
rm -f "${LOCKFILE}"
echo "Cleaned up at $(date)" >> /var/log/myscript.log
}
trap cleanup EXIT
on_error() {
local line=$1
echo "ERROR: Script failed at line ${line}" >&2
# Send alert, log to syslog, etc.
logger -t myscript "FAILED at line ${line}"
}
trap 'on_error ${LINENO}' ERR
3. Lock Files¶
Prevent concurrent execution when scripts modify shared state:
LOCKFILE="/var/run/fleet-patch.lock"
acquire_lock() {
if ! mkdir "${LOCKFILE}" 2>/dev/null; then
local pid
pid=$(cat "${LOCKFILE}/pid" 2>/dev/null || echo "unknown")
echo "Lock held by PID ${pid}. Exiting." >&2
exit 1
fi
echo $$ > "${LOCKFILE}/pid"
}
release_lock() {
rm -rf "${LOCKFILE}"
}
acquire_lock
trap release_lock EXIT
Under the hood:
mkdiris atomic on all POSIX filesystems because the kernel creates the directory in a single syscall that either succeeds or fails -- there's no race window. This makes it a reliable cross-platform lock primitive. Theflockcommand uses theflock(2)syscall, which is kernel-level advisory locking -- faster and more robust, but not available on all systems (notably missing on some NFS mounts).
Using mkdir for locking is atomic on all filesystems. Using flock is better when available:
Structured Logging¶
Production scripts need parseable output, not ad-hoc echo statements:
readonly LOG_FILE="/var/log/fleet-ops.log"
readonly SCRIPT_NAME=$(basename "$0")
log() {
local level=$1; shift
local msg="$*"
local ts
ts=$(date -u '+%Y-%m-%dT%H:%M:%SZ')
printf '%s [%s] %s: %s\n' "${ts}" "${level}" "${SCRIPT_NAME}" "${msg}" | tee -a "${LOG_FILE}"
}
log INFO "Starting fleet patch cycle"
log WARN "Host db-03 unreachable, skipping"
log ERROR "Patch failed on web-12: exit code 137"
Argument Parsing¶
Use getopts for simple flags, or a manual loop for long options:
usage() {
cat <<EOF
Usage: ${0##*/} [-n] [-v] [-t TIMEOUT] [-h] HOST_PATTERN
-n Dry run (no changes)
-v Verbose output
-t TIMEOUT SSH timeout in seconds (default: 10)
-h Show this help
EOF
exit 1
}
DRY_RUN=false
VERBOSE=false
TIMEOUT=10
while getopts ":nvt:h" opt; do
case ${opt} in
n) DRY_RUN=true ;;
v) VERBOSE=true ;;
t) TIMEOUT=${OPTARG} ;;
h) usage ;;
:) echo "Option -${OPTARG} requires an argument" >&2; usage ;;
\?) echo "Unknown option -${OPTARG}" >&2; usage ;;
esac
done
shift $((OPTIND - 1))
[[ $# -lt 1 ]] && { echo "HOST_PATTERN required" >&2; usage; }
HOST_PATTERN=$1
Arrays and Iteration¶
Bash arrays are essential for handling lists of hosts, files, or arguments safely:
# Declare arrays
declare -a HOSTS=()
declare -a FAILED=()
declare -a SKIPPED=()
# Build host list from inventory
while IFS= read -r host; do
[[ -z "${host}" || "${host}" == \#* ]] && continue
HOSTS+=("${host}")
done < /etc/fleet/inventory.txt
# Iterate with index
for i in "${!HOSTS[@]}"; do
host="${HOSTS[$i]}"
echo "[${i}/${#HOSTS[@]}] Processing ${host}..."
done
# Report
echo "Total: ${#HOSTS[@]} Failed: ${#FAILED[@]} Skipped: ${#SKIPPED[@]}"
Process Substitution and File Descriptors¶
# Compare two remote file listings without temp files
diff <(ssh host1 'ls /etc/configs/') <(ssh host2 'ls /etc/configs/')
# Redirect stdout and stderr to different files
exec 1>>/var/log/myscript.out
exec 2>>/var/log/myscript.err
# Tee to both log and stdout using fd 3
exec 3>&1
exec 1> >(tee -a /var/log/myscript.log >&3)
String Manipulation¶
Bash built-in string operations avoid forking to sed/awk for simple tasks:
# Parameter expansion
filename="/path/to/config.yaml.bak"
echo "${filename##*/}" # config.yaml.bak (basename)
echo "${filename%.*}" # /path/to/config.yaml (remove extension)
echo "${filename%%.*}" # /path/to/config (remove all extensions)
echo "${filename/yaml/json}" # /path/to/config.json.bak (substitution)
# Default values
DB_HOST="${DB_HOST:-localhost}"
DB_PORT="${DB_PORT:=5432}" # Also assigns if unset
# Length
echo "${#filename}" # 25
Exit Codes as API¶
Define meaningful exit codes so callers can react programmatically:
readonly E_SUCCESS=0
readonly E_USAGE=1
readonly E_LOCK=2
readonly E_SSH=3
readonly E_TIMEOUT=4
readonly E_PARTIAL=5 # Some hosts succeeded, some failed
main() {
# ... script logic ...
if [[ ${#FAILED[@]} -gt 0 && ${#SUCCEEDED[@]} -gt 0 ]]; then
exit ${E_PARTIAL}
elif [[ ${#FAILED[@]} -gt 0 ]]; then
exit ${E_SSH}
fi
exit ${E_SUCCESS}
}
Common Patterns¶
Retry with Backoff¶
retry() {
local max_attempts=$1; shift
local delay=$1; shift
local attempt=1
while (( attempt <= max_attempts )); do
if "$@"; then
return 0
fi
echo "Attempt ${attempt}/${max_attempts} failed. Retrying in ${delay}s..." >&2
sleep "${delay}"
delay=$(( delay * 2 ))
attempt=$(( attempt + 1 ))
done
return 1
}
retry 3 5 ssh "${host}" 'systemctl restart nginx'
Parallel Execution with Controlled Concurrency¶
MAX_PARALLEL=10
run_parallel() {
local -a pids=()
for host in "${HOSTS[@]}"; do
process_host "${host}" &
pids+=($!)
# Throttle
if (( ${#pids[@]} >= MAX_PARALLEL )); then
wait -n # Wait for any one to finish (bash 4.3+)
# Clean up finished pids
local -a active=()
for pid in "${pids[@]}"; do
kill -0 "${pid}" 2>/dev/null && active+=("${pid}")
done
pids=("${active[@]}")
fi
done
wait # Wait for remaining
}
Here-Doc for Remote Commands¶
ssh -o ConnectTimeout=10 "${host}" bash -s <<'REMOTE'
set -euo pipefail
echo "Running on $(hostname)"
systemctl status nginx
df -h /var/log
REMOTE
Note: single-quoting 'REMOTE' prevents local variable expansion. Remove quotes to allow it.
Gotcha:
set -ehas a subtle trap with subshells and command substitution.result=$(failing_command)will trigger theset -eexit. Butfailing_command | other_commandwill NOT exit (unlessset -o pipefailis also set), andif failing_command; thenwill NOT exit because the command is in a condition context. These exceptions trip even experienced scripters.
Testing Bash Scripts¶
ShellCheck¶
Always run shellcheck on production scripts:
ShellCheck catches quoting issues, unused variables, POSIX compatibility problems, and common logic errors.
BATS (Bash Automated Testing)¶
#!/usr/bin/env bats
@test "lock file prevents concurrent runs" {
mkdir /var/run/fleet-patch.lock
run ./fleet-patch.sh
[ "$status" -eq 2 ]
[[ "$output" == *"Lock held"* ]]
rmdir /var/run/fleet-patch.lock
}
@test "dry run makes no changes" {
run ./fleet-patch.sh -n web-01
[ "$status" -eq 0 ]
[[ "$output" == *"DRY RUN"* ]]
}
The Deeper Patterns Behind Power One-Liners¶
The brilliant one-liners are not about memorizing syntax — they are about recognizing composable patterns. Once you see these patterns, you can improvise solutions to problems you have never seen before.
| Pattern | What it means | Example |
|---|---|---|
| Process substitution | Treat command output like a file | diff <(sort a) <(sort b) |
| Stream fan-out | One stream, many consumers | tee >(cmd1) >(cmd2) |
| State normalization | Convert messy reality into plain text | find \| sort, lsof, ps, ss |
| Zero-temp-file transport | Stream instead of save-then-move | tar \| ssh \| tar |
| Incremental narrowing | Cheap filter before expensive work | Size-first duplicate finder |
| Time-sliced inspection | Compare now vs later | diff <(cmd) <(sleep 10; cmd) |
| FIFO loop closure | Named pipe turns linear pipes into circuits | mkfifo \| nc \| tee \| nc > fifo |
The key insight: the shell stops being "command runner" and starts acting like a tiny dataflow language. Every power one-liner is really converting a problem into text streams, then composing tiny tools. That is the Unix philosophy in one sentence.
Name origin: Bash stands for "Bourne Again SHell" -- a pun on "born again" and the Bourne Shell (
sh) created by Stephen Bourne at Bell Labs in 1979. Brian Fox wrote Bash for the GNU Project in 1989. The current maintainer is Chet Ramey, who has maintained it since the early 1990s.
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Next Steps¶
- Regex & Text Wrangling (Topic Pack, L1)
Related Content¶
- Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Bash / Shell Scripting, Linux Fundamentals
- Environment Variables (Topic Pack, L1) — Bash / Shell Scripting, Linux Fundamentals
- LPIC / LFCS Exam Preparation (Topic Pack, L2) — Bash / Shell Scripting, Linux Fundamentals
- Linux Ops (Topic Pack, L0) — Bash / Shell Scripting, Linux Fundamentals
- Linux Ops Drills (Drill, L0) — Bash / Shell Scripting, Linux Fundamentals
- Pipes & Redirection (Topic Pack, L1) — Bash / Shell Scripting, Linux Fundamentals
- Process Management (Topic Pack, L1) — Bash / Shell Scripting, Linux Fundamentals
- RHCE (EX294) Exam Preparation (Topic Pack, L2) — Bash / Shell Scripting, Linux Fundamentals
- Regex & Text Wrangling (Topic Pack, L1) — Bash / Shell Scripting, Linux Fundamentals
- Track: Foundations (Reference, L0) — Bash / Shell Scripting, Linux Fundamentals