Portal | Level: L1: Foundations | Topics: awk, Bash / Shell Scripting | Domain: CLI Tools
awk — The Record/Field Processor Primer¶
Why This Matters¶
awk is the Unix power tool for structured text. Where sed thinks in lines and substitutions, awk thinks in records and fields — it sees every line as a data row and every word as a column. It has variables, arrays, arithmetic, control flow, and printf formatting. It is a small programming language disguised as a command-line text processor.
Every log file, CSV, TSV, command output, and config file you encounter in operations is structured text that awk can parse, filter, aggregate, and reshape in a single pass. If you cannot reach for awk when you need to sum a column, count occurrences, or reformat fields, you are writing throwaway scripts for problems awk solves in one line.
Fun fact: awk was created in 1977 by Alfred Aho, Peter Weinberger, and Brian Kernighan at Bell Labs — the name is their initials. It was one of the first data-driven programming languages. The
gawk(GNU awk) variant is the default on most Linux distributions and adds features like network I/O and multidimensional arrays.mawkis a faster but less feature-rich alternative often found in Debian/Ubuntu.Remember: awk's key advantage over grep/sed: it understands columns. While grep finds lines and sed transforms text, awk splits each line into fields (
$1,$2, ...,$NF) and lets you compute, filter, and aggregate by field. The commandawk '{sum+=$3} END {print sum}'sums the third column of an entire file — try doing that with grep or sed.
Mental Model¶
awk sees every line as a record and every whitespace-separated word as a field. It processes input record by record, applying pattern-action rules to each one.
The Pattern-Action Structure¶
Every awk program is a set of pattern { action } rules:
# Print every line (default action is print)
awk '{ print }' file.txt
# Print lines matching a pattern
awk '/ERROR/ { print }' file.txt
# Print lines where field 3 is greater than 100
awk '$3 > 100 { print }' file.txt
# Print field 1 and field 3 from every line
awk '{ print $1, $3 }' file.txt
Built-In Variables¶
| Variable | Meaning |
|---|---|
$0 |
Entire current record (line) |
$1, $2, ... |
Fields 1, 2, etc. |
NR |
Number of records read so far (line number) |
NF |
Number of fields in the current record |
FS |
Input field separator (default: whitespace) |
OFS |
Output field separator (default: space) |
RS |
Input record separator (default: newline) |
ORS |
Output record separator (default: newline) |
FILENAME |
Name of current input file |
FNR |
Record number in current file |
# Print line number and line
awk '{ print NR, $0 }' file.txt
# Print last field of every line
awk '{ print $NF }' file.txt
# Print lines with more than 5 fields
awk 'NF > 5' file.txt
# Print second-to-last field
awk '{ print $(NF-1) }' file.txt
Field Separator¶
# Colon-separated (like /etc/passwd)
awk -F: '{ print $1, $3 }' /etc/passwd
# CSV (naive — does not handle quoted commas)
awk -F, '{ print $2 }' data.csv
# Multiple characters as separator
awk -F'::' '{ print $1 }' file.txt
# Set FS in BEGIN block
awk 'BEGIN { FS=":" } { print $1 }' /etc/passwd
# Tab-separated output
awk -F: 'BEGIN { OFS="\t" } { print $1, $3, $7 }' /etc/passwd
BEGIN and END Blocks¶
# Print a header, process lines, print a footer
awk 'BEGIN { print "User\tUID\tShell" }
{ print $1, $3, $7 }
END { print "---\nTotal:", NR, "users" }' FS=: /etc/passwd
# Sum a column
awk '{ sum += $3 } END { print "Total:", sum }' data.txt
# Count lines matching a pattern
awk '/ERROR/ { count++ } END { print count, "errors" }' logfile.txt
One-liner: The
awk '!seen[$0]++'deduplication idiom is one of the most elegant awk patterns. It works because: 1)seen[$0]is initially 0 (false) for unseen lines, 2) the!negates it to true (so awk prints the line), 3)++increments the count to 1 (so next time!seen[$0]is false and the duplicate is skipped). All in 15 characters.
Printf Formatting¶
# Formatted output (like C printf)
awk -F: '{ printf "%-20s %5d %s\n", $1, $3, $7 }' /etc/passwd
# Right-justified numbers
awk '{ printf "%10.2f\n", $1 }' numbers.txt
# Output with padding
awk '{ printf "%05d %s\n", NR, $0 }' file.txt
Associative Arrays¶
awk arrays are associative (like dictionaries/hashmaps). Indices can be any string.
# Count occurrences of each value in column 1
awk '{ count[$1]++ } END { for (k in count) print k, count[k] }' access.log
# Sum values grouped by a key
awk '{ total[$1] += $2 } END { for (k in total) print k, total[k] }' sales.txt
# Deduplicate lines (preserving first occurrence)
awk '!seen[$0]++' file.txt
# Find most frequent value
awk '{ count[$1]++ }
END { for (k in count) if (count[k] > max) { max=count[k]; val=k }
print val, max }' data.txt
String Functions¶
# length — string length
awk '{ print length($0) }' file.txt
# substr — extract substring
awk '{ print substr($0, 1, 10) }' file.txt
# index — find position of substring
awk '{ pos = index($0, "error"); if (pos) print NR, pos }' file.txt
# split — split string into array
awk '{ n = split($1, parts, "."); print parts[1], parts[n] }' hostnames.txt
# sub — replace first match
awk '{ sub(/http:/, "https:"); print }' urls.txt
# gsub — replace all matches
awk '{ gsub(/[[:space:]]+/, " "); print }' messy.txt
# match — regex match with position/length
awk '{ if (match($0, /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/))
print substr($0, RSTART, RLENGTH) }' logfile.txt
# tolower / toupper
awk '{ print toupper($1) }' file.txt
Math Operations¶
# Average
awk '{ sum += $1; count++ } END { print sum/count }' numbers.txt
# Min and max
awk 'NR==1 { min=$1; max=$1 }
$1 < min { min=$1 }
$1 > max { max=$1 }
END { print "min:", min, "max:", max }' numbers.txt
# Percentages
awk '{ total += $2 }
END { for (i=1; i<=NR; i++) printf "%s %.1f%%\n", data[i], (vals[i]/total)*100 }' file.txt
# Running total
awk '{ sum += $1; print $0, sum }' numbers.txt
# Random sampling (approximately 10% of lines)
awk 'BEGIN { srand() } rand() < 0.1' bigfile.txt
Gotcha: awk's default field separator is not just space — it is "one or more whitespace characters" (spaces and tabs). This means leading whitespace is consumed:
echo " hello world" | awk '{print $1}'printshello, not an empty string. If you need to preserve or detect leading whitespace, use a different approach or setFSexplicitly to a single space character.Debug clue: When an awk script produces unexpected output on a field, use
awk '{print NF, "|" $0 "|"}'to inspect the actual number of fields and the raw line. Invisible characters (carriage returns from Windows files, non-breaking spaces from web copy-paste) silently corrupt field splitting. Runcat -A file.txtto reveal hidden characters —^Mat line ends means Windows\r\nline endings.Under the hood: awk processes input in a single pass, reading one record at a time, which makes it memory-efficient even on multi-gigabyte files. Unlike loading a file into Python with
readlines(), awk never holds the entire file in memory (unless you store every line in an array). This single-pass design is why awk can process a 10 GB log file on a machine with 256 MB of RAM — it reads, processes, and discards each line before reading the next.War story: A junior engineer wrote a Python script to parse a 50 GB access log and count requests per IP. It took 45 minutes and consumed 12 GB of RAM. The equivalent
awk '{count[$1]++} END {for (ip in count) print count[ip], ip}'ran in 3 minutes using 800 KB of memory. The lesson: reach for awk first when the task is field extraction, counting, or aggregation on structured text. Save Python for when you need libraries, complex data structures, or error handling beyond what awk provides.
Getline¶
getline reads from input, files, or commands. Use sparingly — it complicates control flow.
# Read next line manually
awk '/START/ { getline; print }' file.txt
# Read from a file
awk 'BEGIN { while ((getline line < "lookup.txt") > 0) map[line]=1 }
$1 in map' data.txt
# Read from a command
awk 'BEGIN { "date" | getline d; print "Report date:", d }'
Conditional Logic and Loops¶
# If-else
awk '{ if ($3 > 90) print $1, "HIGH"; else if ($3 > 50) print $1, "MED"; else print $1, "LOW" }' data.txt
# For loop
awk '{ for (i=1; i<=NF; i++) if ($i ~ /error/) print NR, $i }' logfile.txt
# While loop
awk '{ i=1; while (i<=NF) { print $i; i++ } }' file.txt
# Ternary
awk '{ print $1, ($2 > 0 ? "positive" : "non-positive") }' numbers.txt
Interview tip: "Parse this log file and show the top 10 IP addresses by request count" is a classic awk interview question. The answer:
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10. The pure-awk version:awk '{count[$1]++} END {for (ip in count) print count[ip], ip}' access.log | sort -rn | head -10. Knowing both the piped and pure-awk approaches shows versatility.
awk vs sed — When to Reach for Which¶
| Task | Tool | Why |
|---|---|---|
| Extract specific fields | awk | Built-in field splitting |
| Math on columns | awk | Has arithmetic operators |
| Count/group/aggregate | awk | Associative arrays |
| Multi-column reformatting | awk | printf + field access |
| Simple find-and-replace | sed | One-liner, no field parsing needed |
| Delete/insert lines | sed | Address-based line operations |
| In-place file editing | sed | -i flag |
| Quick regex substitution in a pipe | sed | Lighter syntax for simple cases |
When the task involves fields, numbers, or any form of aggregation, use awk. When the task is substitution or line-level deletion, use sed. When the task is complex, consider writing a proper script.
Timeline: awk (1977) predates Perl (1987), Python (1991), and Ruby (1995). Despite being nearly 50 years old, it ships on every Unix-like system and its core syntax has not changed. The POSIX standard (IEEE 1003.1) specifies awk behavior, so scripts work identically on Linux, macOS, FreeBSD, and Solaris. Learning awk is one of the highest-ROI investments in shell fluency because the skill never expires and the tool is always available.
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Related Content¶
- Advanced Bash for Ops (Topic Pack, L1) — Bash / Shell Scripting
- Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Bash / Shell Scripting
- Bash Flashcards (CLI) (flashcard_deck, L1) — Bash / Shell Scripting
- Cron & Job Scheduling (Topic Pack, L1) — Bash / Shell Scripting
- Environment Variables (Topic Pack, L1) — Bash / Shell Scripting
- Fleet Operations at Scale (Topic Pack, L2) — Bash / Shell Scripting
- LPIC / LFCS Exam Preparation (Topic Pack, L2) — Bash / Shell Scripting
- Linux Ops (Topic Pack, L0) — Bash / Shell Scripting
- Linux Ops Drills (Drill, L0) — Bash / Shell Scripting
- Linux Text Processing (Topic Pack, L1) — Bash / Shell Scripting