Skip to content

Portal | Level: L1: Foundations | Topics: Bash / Shell Scripting, Linux Fundamentals | Domain: Linux

Advanced Bash for Ops - Primer

Why This Matters

Bash is the lingua franca of infrastructure. Every server has it. Every CI pipeline runs it. Every runbook assumes it. Most DevOps engineers can write a basic script, but production Bash — the kind that runs unattended at 3 AM across 1,500 servers — requires discipline that casual scripting never teaches. Bad Bash is the #1 source of self-inflicted outages in operations teams.

Core Principles

1. Defensive Defaults

Every production script should start with strict mode:

#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'
Flag Effect
-e Exit immediately on non-zero return
-u Treat unset variables as errors
-o pipefail Pipe fails if any command in the chain fails

Without these, a script can silently fail halfway through and continue executing destructive operations on stale state.

Remember: Mnemonic for set -euo pipefail: "E-U-P: Exit on errors, Unset vars are errors, Pipes fail properly." Think of it as the safety harness for production bash. Some teams enforce it via shellcheck rules (SC2086, SC2154).

2. Trap Handlers

Clean up resources when scripts exit — whether normally, on error, or on signal:

TMPDIR=$(mktemp -d)
LOCKFILE="/var/run/myprocess.lock"

cleanup() {
    rm -rf "${TMPDIR}"
    rm -f "${LOCKFILE}"
    echo "Cleaned up at $(date)" >> /var/log/myscript.log
}
trap cleanup EXIT

on_error() {
    local line=$1
    echo "ERROR: Script failed at line ${line}" >&2
    # Send alert, log to syslog, etc.
    logger -t myscript "FAILED at line ${line}"
}
trap 'on_error ${LINENO}' ERR

3. Lock Files

Prevent concurrent execution when scripts modify shared state:

LOCKFILE="/var/run/fleet-patch.lock"

acquire_lock() {
    if ! mkdir "${LOCKFILE}" 2>/dev/null; then
        local pid
        pid=$(cat "${LOCKFILE}/pid" 2>/dev/null || echo "unknown")
        echo "Lock held by PID ${pid}. Exiting." >&2
        exit 1
    fi
    echo $$ > "${LOCKFILE}/pid"
}

release_lock() {
    rm -rf "${LOCKFILE}"
}

acquire_lock
trap release_lock EXIT

Under the hood: mkdir is atomic on all POSIX filesystems because the kernel creates the directory in a single syscall that either succeeds or fails -- there's no race window. This makes it a reliable cross-platform lock primitive. The flock command uses the flock(2) syscall, which is kernel-level advisory locking -- faster and more robust, but not available on all systems (notably missing on some NFS mounts).

Using mkdir for locking is atomic on all filesystems. Using flock is better when available:

exec 200>/var/run/myscript.lock
flock -n 200 || { echo "Already running"; exit 1; }

Structured Logging

Production scripts need parseable output, not ad-hoc echo statements:

readonly LOG_FILE="/var/log/fleet-ops.log"
readonly SCRIPT_NAME=$(basename "$0")

log() {
    local level=$1; shift
    local msg="$*"
    local ts
    ts=$(date -u '+%Y-%m-%dT%H:%M:%SZ')
    printf '%s [%s] %s: %s\n' "${ts}" "${level}" "${SCRIPT_NAME}" "${msg}" | tee -a "${LOG_FILE}"
}

log INFO "Starting fleet patch cycle"
log WARN "Host db-03 unreachable, skipping"
log ERROR "Patch failed on web-12: exit code 137"

Argument Parsing

Use getopts for simple flags, or a manual loop for long options:

usage() {
    cat <<EOF
Usage: ${0##*/} [-n] [-v] [-t TIMEOUT] [-h] HOST_PATTERN
  -n          Dry run (no changes)
  -v          Verbose output
  -t TIMEOUT  SSH timeout in seconds (default: 10)
  -h          Show this help
EOF
    exit 1
}

DRY_RUN=false
VERBOSE=false
TIMEOUT=10

while getopts ":nvt:h" opt; do
    case ${opt} in
        n) DRY_RUN=true ;;
        v) VERBOSE=true ;;
        t) TIMEOUT=${OPTARG} ;;
        h) usage ;;
        :) echo "Option -${OPTARG} requires an argument" >&2; usage ;;
        \?) echo "Unknown option -${OPTARG}" >&2; usage ;;
    esac
done
shift $((OPTIND - 1))

[[ $# -lt 1 ]] && { echo "HOST_PATTERN required" >&2; usage; }
HOST_PATTERN=$1

Arrays and Iteration

Bash arrays are essential for handling lists of hosts, files, or arguments safely:

# Declare arrays
declare -a HOSTS=()
declare -a FAILED=()
declare -a SKIPPED=()

# Build host list from inventory
while IFS= read -r host; do
    [[ -z "${host}" || "${host}" == \#* ]] && continue
    HOSTS+=("${host}")
done < /etc/fleet/inventory.txt

# Iterate with index
for i in "${!HOSTS[@]}"; do
    host="${HOSTS[$i]}"
    echo "[${i}/${#HOSTS[@]}] Processing ${host}..."
done

# Report
echo "Total: ${#HOSTS[@]}  Failed: ${#FAILED[@]}  Skipped: ${#SKIPPED[@]}"

Process Substitution and File Descriptors

# Compare two remote file listings without temp files
diff <(ssh host1 'ls /etc/configs/') <(ssh host2 'ls /etc/configs/')

# Redirect stdout and stderr to different files
exec 1>>/var/log/myscript.out
exec 2>>/var/log/myscript.err

# Tee to both log and stdout using fd 3
exec 3>&1
exec 1> >(tee -a /var/log/myscript.log >&3)

String Manipulation

Bash built-in string operations avoid forking to sed/awk for simple tasks:

# Parameter expansion
filename="/path/to/config.yaml.bak"
echo "${filename##*/}"        # config.yaml.bak  (basename)
echo "${filename%.*}"         # /path/to/config.yaml  (remove extension)
echo "${filename%%.*}"        # /path/to/config  (remove all extensions)
echo "${filename/yaml/json}"  # /path/to/config.json.bak  (substitution)

# Default values
DB_HOST="${DB_HOST:-localhost}"
DB_PORT="${DB_PORT:=5432}"     # Also assigns if unset

# Length
echo "${#filename}"            # 25

Exit Codes as API

Define meaningful exit codes so callers can react programmatically:

readonly E_SUCCESS=0
readonly E_USAGE=1
readonly E_LOCK=2
readonly E_SSH=3
readonly E_TIMEOUT=4
readonly E_PARTIAL=5    # Some hosts succeeded, some failed

main() {
    # ... script logic ...
    if [[ ${#FAILED[@]} -gt 0 && ${#SUCCEEDED[@]} -gt 0 ]]; then
        exit ${E_PARTIAL}
    elif [[ ${#FAILED[@]} -gt 0 ]]; then
        exit ${E_SSH}
    fi
    exit ${E_SUCCESS}
}

Common Patterns

Retry with Backoff

retry() {
    local max_attempts=$1; shift
    local delay=$1; shift
    local attempt=1

    while (( attempt <= max_attempts )); do
        if "$@"; then
            return 0
        fi
        echo "Attempt ${attempt}/${max_attempts} failed. Retrying in ${delay}s..." >&2
        sleep "${delay}"
        delay=$(( delay * 2 ))
        attempt=$(( attempt + 1 ))
    done
    return 1
}

retry 3 5 ssh "${host}" 'systemctl restart nginx'

Parallel Execution with Controlled Concurrency

MAX_PARALLEL=10

run_parallel() {
    local -a pids=()
    for host in "${HOSTS[@]}"; do
        process_host "${host}" &
        pids+=($!)

        # Throttle
        if (( ${#pids[@]} >= MAX_PARALLEL )); then
            wait -n  # Wait for any one to finish (bash 4.3+)
            # Clean up finished pids
            local -a active=()
            for pid in "${pids[@]}"; do
                kill -0 "${pid}" 2>/dev/null && active+=("${pid}")
            done
            pids=("${active[@]}")
        fi
    done
    wait  # Wait for remaining
}

Here-Doc for Remote Commands

ssh -o ConnectTimeout=10 "${host}" bash -s <<'REMOTE'
    set -euo pipefail
    echo "Running on $(hostname)"
    systemctl status nginx
    df -h /var/log
REMOTE

Note: single-quoting 'REMOTE' prevents local variable expansion. Remove quotes to allow it.

Gotcha: set -e has a subtle trap with subshells and command substitution. result=$(failing_command) will trigger the set -e exit. But failing_command | other_command will NOT exit (unless set -o pipefail is also set), and if failing_command; then will NOT exit because the command is in a condition context. These exceptions trip even experienced scripters.

Testing Bash Scripts

ShellCheck

Always run shellcheck on production scripts:

shellcheck -s bash myscript.sh

ShellCheck catches quoting issues, unused variables, POSIX compatibility problems, and common logic errors.

BATS (Bash Automated Testing)

#!/usr/bin/env bats

@test "lock file prevents concurrent runs" {
    mkdir /var/run/fleet-patch.lock
    run ./fleet-patch.sh
    [ "$status" -eq 2 ]
    [[ "$output" == *"Lock held"* ]]
    rmdir /var/run/fleet-patch.lock
}

@test "dry run makes no changes" {
    run ./fleet-patch.sh -n web-01
    [ "$status" -eq 0 ]
    [[ "$output" == *"DRY RUN"* ]]
}

The Deeper Patterns Behind Power One-Liners

The brilliant one-liners are not about memorizing syntax — they are about recognizing composable patterns. Once you see these patterns, you can improvise solutions to problems you have never seen before.

Pattern What it means Example
Process substitution Treat command output like a file diff <(sort a) <(sort b)
Stream fan-out One stream, many consumers tee >(cmd1) >(cmd2)
State normalization Convert messy reality into plain text find \| sort, lsof, ps, ss
Zero-temp-file transport Stream instead of save-then-move tar \| ssh \| tar
Incremental narrowing Cheap filter before expensive work Size-first duplicate finder
Time-sliced inspection Compare now vs later diff <(cmd) <(sleep 10; cmd)
FIFO loop closure Named pipe turns linear pipes into circuits mkfifo \| nc \| tee \| nc > fifo

The key insight: the shell stops being "command runner" and starts acting like a tiny dataflow language. Every power one-liner is really converting a problem into text streams, then composing tiny tools. That is the Unix philosophy in one sentence.

Name origin: Bash stands for "Bourne Again SHell" -- a pun on "born again" and the Bourne Shell (sh) created by Stephen Bourne at Bell Labs in 1979. Brian Fox wrote Bash for the GNU Project in 1989. The current maintainer is Chet Ramey, who has maintained it since the early 1990s.


Wiki Navigation

Prerequisites

Next Steps