Skip to content

Portal | Level: L1: Foundations | Topics: awk, Bash / Shell Scripting | Domain: CLI Tools

awk — The Record/Field Processor Primer

Why This Matters

awk is the Unix power tool for structured text. Where sed thinks in lines and substitutions, awk thinks in records and fields — it sees every line as a data row and every word as a column. It has variables, arrays, arithmetic, control flow, and printf formatting. It is a small programming language disguised as a command-line text processor.

Every log file, CSV, TSV, command output, and config file you encounter in operations is structured text that awk can parse, filter, aggregate, and reshape in a single pass. If you cannot reach for awk when you need to sum a column, count occurrences, or reformat fields, you are writing throwaway scripts for problems awk solves in one line.


Fun fact: awk was created in 1977 by Alfred Aho, Peter Weinberger, and Brian Kernighan at Bell Labs — the name is their initials. It was one of the first data-driven programming languages. The gawk (GNU awk) variant is the default on most Linux distributions and adds features like network I/O and multidimensional arrays. mawk is a faster but less feature-rich alternative often found in Debian/Ubuntu.

Remember: awk's key advantage over grep/sed: it understands columns. While grep finds lines and sed transforms text, awk splits each line into fields ($1, $2, ..., $NF) and lets you compute, filter, and aggregate by field. The command awk '{sum+=$3} END {print sum}' sums the third column of an entire file — try doing that with grep or sed.

Mental Model

awk sees every line as a record and every whitespace-separated word as a field. It processes input record by record, applying pattern-action rules to each one.

The Pattern-Action Structure

Every awk program is a set of pattern { action } rules:

# Print every line (default action is print)
awk '{ print }' file.txt

# Print lines matching a pattern
awk '/ERROR/ { print }' file.txt

# Print lines where field 3 is greater than 100
awk '$3 > 100 { print }' file.txt

# Print field 1 and field 3 from every line
awk '{ print $1, $3 }' file.txt

Built-In Variables

Variable Meaning
$0 Entire current record (line)
$1, $2, ... Fields 1, 2, etc.
NR Number of records read so far (line number)
NF Number of fields in the current record
FS Input field separator (default: whitespace)
OFS Output field separator (default: space)
RS Input record separator (default: newline)
ORS Output record separator (default: newline)
FILENAME Name of current input file
FNR Record number in current file
# Print line number and line
awk '{ print NR, $0 }' file.txt

# Print last field of every line
awk '{ print $NF }' file.txt

# Print lines with more than 5 fields
awk 'NF > 5' file.txt

# Print second-to-last field
awk '{ print $(NF-1) }' file.txt

Field Separator

# Colon-separated (like /etc/passwd)
awk -F: '{ print $1, $3 }' /etc/passwd

# CSV (naive — does not handle quoted commas)
awk -F, '{ print $2 }' data.csv

# Multiple characters as separator
awk -F'::' '{ print $1 }' file.txt

# Set FS in BEGIN block
awk 'BEGIN { FS=":" } { print $1 }' /etc/passwd

# Tab-separated output
awk -F: 'BEGIN { OFS="\t" } { print $1, $3, $7 }' /etc/passwd

BEGIN and END Blocks

# Print a header, process lines, print a footer
awk 'BEGIN { print "User\tUID\tShell" }
     { print $1, $3, $7 }
     END { print "---\nTotal:", NR, "users" }' FS=: /etc/passwd

# Sum a column
awk '{ sum += $3 } END { print "Total:", sum }' data.txt

# Count lines matching a pattern
awk '/ERROR/ { count++ } END { print count, "errors" }' logfile.txt

One-liner: The awk '!seen[$0]++' deduplication idiom is one of the most elegant awk patterns. It works because: 1) seen[$0] is initially 0 (false) for unseen lines, 2) the ! negates it to true (so awk prints the line), 3) ++ increments the count to 1 (so next time !seen[$0] is false and the duplicate is skipped). All in 15 characters.

Printf Formatting

# Formatted output (like C printf)
awk -F: '{ printf "%-20s %5d %s\n", $1, $3, $7 }' /etc/passwd

# Right-justified numbers
awk '{ printf "%10.2f\n", $1 }' numbers.txt

# Output with padding
awk '{ printf "%05d %s\n", NR, $0 }' file.txt

Associative Arrays

awk arrays are associative (like dictionaries/hashmaps). Indices can be any string.

# Count occurrences of each value in column 1
awk '{ count[$1]++ } END { for (k in count) print k, count[k] }' access.log

# Sum values grouped by a key
awk '{ total[$1] += $2 } END { for (k in total) print k, total[k] }' sales.txt

# Deduplicate lines (preserving first occurrence)
awk '!seen[$0]++' file.txt

# Find most frequent value
awk '{ count[$1]++ }
     END { for (k in count) if (count[k] > max) { max=count[k]; val=k }
           print val, max }' data.txt

String Functions

# length — string length
awk '{ print length($0) }' file.txt

# substr — extract substring
awk '{ print substr($0, 1, 10) }' file.txt

# index — find position of substring
awk '{ pos = index($0, "error"); if (pos) print NR, pos }' file.txt

# split — split string into array
awk '{ n = split($1, parts, "."); print parts[1], parts[n] }' hostnames.txt

# sub — replace first match
awk '{ sub(/http:/, "https:"); print }' urls.txt

# gsub — replace all matches
awk '{ gsub(/[[:space:]]+/, " "); print }' messy.txt

# match — regex match with position/length
awk '{ if (match($0, /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/))
         print substr($0, RSTART, RLENGTH) }' logfile.txt

# tolower / toupper
awk '{ print toupper($1) }' file.txt

Math Operations

# Average
awk '{ sum += $1; count++ } END { print sum/count }' numbers.txt

# Min and max
awk 'NR==1 { min=$1; max=$1 }
     $1 < min { min=$1 }
     $1 > max { max=$1 }
     END { print "min:", min, "max:", max }' numbers.txt

# Percentages
awk '{ total += $2 }
     END { for (i=1; i<=NR; i++) printf "%s %.1f%%\n", data[i], (vals[i]/total)*100 }' file.txt

# Running total
awk '{ sum += $1; print $0, sum }' numbers.txt

# Random sampling (approximately 10% of lines)
awk 'BEGIN { srand() } rand() < 0.1' bigfile.txt

Gotcha: awk's default field separator is not just space — it is "one or more whitespace characters" (spaces and tabs). This means leading whitespace is consumed: echo " hello world" | awk '{print $1}' prints hello, not an empty string. If you need to preserve or detect leading whitespace, use a different approach or set FS explicitly to a single space character.

Debug clue: When an awk script produces unexpected output on a field, use awk '{print NF, "|" $0 "|"}' to inspect the actual number of fields and the raw line. Invisible characters (carriage returns from Windows files, non-breaking spaces from web copy-paste) silently corrupt field splitting. Run cat -A file.txt to reveal hidden characters — ^M at line ends means Windows \r\n line endings.

Under the hood: awk processes input in a single pass, reading one record at a time, which makes it memory-efficient even on multi-gigabyte files. Unlike loading a file into Python with readlines(), awk never holds the entire file in memory (unless you store every line in an array). This single-pass design is why awk can process a 10 GB log file on a machine with 256 MB of RAM — it reads, processes, and discards each line before reading the next.

War story: A junior engineer wrote a Python script to parse a 50 GB access log and count requests per IP. It took 45 minutes and consumed 12 GB of RAM. The equivalent awk '{count[$1]++} END {for (ip in count) print count[ip], ip}' ran in 3 minutes using 800 KB of memory. The lesson: reach for awk first when the task is field extraction, counting, or aggregation on structured text. Save Python for when you need libraries, complex data structures, or error handling beyond what awk provides.

Getline

getline reads from input, files, or commands. Use sparingly — it complicates control flow.

# Read next line manually
awk '/START/ { getline; print }' file.txt

# Read from a file
awk 'BEGIN { while ((getline line < "lookup.txt") > 0) map[line]=1 }
     $1 in map' data.txt

# Read from a command
awk 'BEGIN { "date" | getline d; print "Report date:", d }'

Conditional Logic and Loops

# If-else
awk '{ if ($3 > 90) print $1, "HIGH"; else if ($3 > 50) print $1, "MED"; else print $1, "LOW" }' data.txt

# For loop
awk '{ for (i=1; i<=NF; i++) if ($i ~ /error/) print NR, $i }' logfile.txt

# While loop
awk '{ i=1; while (i<=NF) { print $i; i++ } }' file.txt

# Ternary
awk '{ print $1, ($2 > 0 ? "positive" : "non-positive") }' numbers.txt

Interview tip: "Parse this log file and show the top 10 IP addresses by request count" is a classic awk interview question. The answer: awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10. The pure-awk version: awk '{count[$1]++} END {for (ip in count) print count[ip], ip}' access.log | sort -rn | head -10. Knowing both the piped and pure-awk approaches shows versatility.

awk vs sed — When to Reach for Which

Task Tool Why
Extract specific fields awk Built-in field splitting
Math on columns awk Has arithmetic operators
Count/group/aggregate awk Associative arrays
Multi-column reformatting awk printf + field access
Simple find-and-replace sed One-liner, no field parsing needed
Delete/insert lines sed Address-based line operations
In-place file editing sed -i flag
Quick regex substitution in a pipe sed Lighter syntax for simple cases

When the task involves fields, numbers, or any form of aggregation, use awk. When the task is substitution or line-level deletion, use sed. When the task is complex, consider writing a proper script.

Timeline: awk (1977) predates Perl (1987), Python (1991), and Ruby (1995). Despite being nearly 50 years old, it ships on every Unix-like system and its core syntax has not changed. The POSIX standard (IEEE 1003.1) specifies awk behavior, so scripts work identically on Linux, macOS, FreeBSD, and Solaris. Learning awk is one of the highest-ROI investments in shell fluency because the skill never expires and the tool is always available.


Wiki Navigation

Prerequisites