Skip to content

Scripting Rosetta — Lesson 2: File Operations

Bundle: Bash + Python + CLI Tools Level: L1–L2 (Foundations → Operations) Time: 45–60 minutes Prerequisites: Lesson 1 (text processing), basic terminal comfort


What You'll Learn

By the end of this lesson you'll be able to: - Find files by name, size, age, and content in both bash and Python - Bulk rename files using shell loops and pathlib - Audit file permissions in both languages - Know when find/xargs beats Python and vice versa


Part 1: The Mission

You've inherited a messy project directory. Thousands of files, no naming convention, stale logs eating disk, and some files have permissions wide open (0777). Your job:

  1. Find all .log files older than 30 days
  2. Bulk rename *.BACKUP files to *.bak
  3. Find files with dangerous permissions (world-writable)
  4. Locate all files containing a hardcoded password string

Let's solve each one two ways.


Part 2: Find Old Log Files

The Bash Way

find /var/log -name '*.log' -mtime +30 -type f

One command. Let's unpack it:

Piece What it does
find /var/log Start searching from this directory, recursively
-name '*.log' Match files ending in .log (case-sensitive)
-mtime +30 Modified more than 30 days ago
-type f Regular files only (skip directories, symlinks)

Want to also see the size? Add -ls:

find /var/log -name '*.log' -mtime +30 -type f -ls

Want to delete them? Add -delete — but always preview first:

# Preview what would be deleted
find /var/log -name '*.log' -mtime +30 -type f -print

# Then delete (CAREFUL — no undo)
find /var/log -name '*.log' -mtime +30 -type f -delete

Safety Rule: Never put -delete before your filter predicates. find /var/log -delete -name '*.log' deletes EVERYTHING first, then filters nothing. find evaluates left to right. This has destroyed real production systems.

The Python Way

from pathlib import Path
from datetime import datetime, timedelta

cutoff = datetime.now().timestamp() - (30 * 86400)

for p in Path("/var/log").rglob("*.log"):
    if p.is_file() and p.stat().st_mtime < cutoff:
        print(p)
Piece What it does
Path("/var/log") Create a path object rooted at /var/log
.rglob("*.log") Recursive glob — like find -name '*.log'
.is_file() Skip directories and symlinks
.stat().st_mtime File's last modification time (epoch seconds)

To delete:

for p in Path("/var/log").rglob("*.log"):
    if p.is_file() and p.stat().st_mtime < cutoff:
        p.unlink()
        print(f"deleted: {p}")

Side-by-Side

Bash (find) Python (pathlib)
Lines of code 1 6
Time filter -mtime +30 (days, built-in) Manual epoch math
Pattern matching -name, -iname, -regex .glob(), .rglob()
Recursion Automatic .rglob() or os.walk()
Safety -delete is irrecoverable .unlink() is irrecoverable
When to pick it Quick cleanup from the terminal Part of a larger script with logging

pathlib vs os.path: Python has two file APIs. os.path is the old way (string-based). pathlib (Python 3.4+) is the modern way (object-oriented). Use pathlib for new code — it's cleaner and composes better.


Part 3: Bulk Rename Files

You have 200 files like report.BACKUP, data.BACKUP, config.BACKUP and you want them renamed to .bak.

The Bash Way

for f in *.BACKUP; do
    mv "$f" "${f%.BACKUP}.bak"
done

The magic is ${f%.BACKUP} — bash parameter expansion:

Syntax What it does
${var%pattern} Remove shortest match of pattern from the END
${var%%pattern} Remove longest match from the END
${var#pattern} Remove shortest match from the START
${var##pattern} Remove longest match from the START

So ${f%.BACKUP} strips .BACKUP from the end, then we append .bak.

For a recursive rename, combine with find:

find . -name '*.BACKUP' -type f -exec bash -c '
    for f; do mv "$f" "${f%.BACKUP}.bak"; done
' _ {} +

Why {} + instead of {} \;?

\; runs one command per file: mv file1, mv file2, mv file3 — 200 fork/execs. + batches files into one command: bash -c '...' _ file1 file2 file3 — much faster. This is the same idea as xargs batching.

The Python Way

from pathlib import Path

for p in Path(".").rglob("*.BACKUP"):
    p.rename(p.with_suffix(".bak"))

That's it. Two lines. pathlib makes rename operations trivial:

Method What it does
p.with_suffix(".bak") Same path, different extension
p.with_name("new.txt") Same directory, different filename
p.with_stem("new") Same directory and extension, different stem (3.9+)
p.rename(target) Move/rename the file

Side-by-Side

Bash Python
Simple case for f in *.X; do mv ... for p in Path(".").glob("*.X"): p.rename(...)
Recursive find + exec or find + xargs .rglob()
Dry run echo instead of mv print() instead of .rename()
Edge cases Must quote "$f" for spaces pathlib handles spaces natively
Undo None (write a reverse script) None (write a reverse script)

The #1 Bash Rename Mistake: Forgetting quotes.

# WRONG — breaks on filenames with spaces
for f in *.BACKUP; do mv $f ${f%.BACKUP}.bak; done

# RIGHT — always quote variables in bash
for f in *.BACKUP; do mv "$f" "${f%.BACKUP}.bak"; done

In Python, this bug doesn't exist — pathlib treats paths as objects, not strings.


Part 4: Find World-Writable Files

Files with 0777 or any world-writable permission are a security risk.

The Bash Way

find / -type f -perm -o=w -not -path '/proc/*' -not -path '/sys/*' 2>/dev/null
Piece What it does
-perm -o=w "Others" have write permission (the - means "at least these bits")
-not -path '/proc/*' Skip virtual filesystems
2>/dev/null Suppress "Permission denied" errors

Variants:

# Exact 0777
find / -type f -perm 0777

# SUID bit set (another security risk)
find / -type f -perm -u=s

# Group-writable in a specific directory
find /opt/app -type f -perm -g=w

Permission Bits Refresher:

-rwxrwxrwx = 0777
 |||||||||
 |||||||||── others: execute
 ||||||||─── others: write     ← this is the dangerous one
 |||||||──── others: read
 ||||||───── group: execute
 |||||────── group: write
 ||||─────── group: read
 |||──────── owner: execute
 ||───────── owner: write
 |────────── owner: read

The Python Way

import stat
from pathlib import Path

for p in Path("/opt/app").rglob("*"):
    if p.is_file():
        mode = p.stat().st_mode
        if mode & stat.S_IWOTH:  # others-write bit
            print(f"{oct(mode)[-3:]}  {p}")
Constant Meaning
stat.S_IWOTH Others write
stat.S_IWGRP Group write
stat.S_ISUID SUID bit
stat.S_ISGID SGID bit
stat.S_ISVTX Sticky bit

To fix permissions:

import stat
from pathlib import Path

for p in Path("/opt/app").rglob("*"):
    if p.is_file():
        mode = p.stat().st_mode
        if mode & stat.S_IWOTH:
            new_mode = mode & ~stat.S_IWOTH  # clear the bit
            p.chmod(new_mode)
            print(f"fixed: {p} ({oct(mode)[-3:]}{oct(new_mode)[-3:]})")

Side-by-Side

Bash (find -perm) Python (stat)
Readability -perm -o=w is concise but cryptic stat.S_IWOTH is verbose but self-documenting
Speed Fast — find is C, optimized for this Slower — Python stat() call per file
Fixing find ... -exec chmod o-w {} + p.chmod(new_mode)
Reporting Pipe to tee or xargs ls -la Build a dict, generate CSV, send alert

When Bash Wins: Quick audit — "do any world-writable files exist?" One command, done.

When Python Wins: You need to generate a compliance report, fix permissions AND log what changed, or integrate with a ticketing system.


Part 5: Find Files Containing a String

Someone hardcoded password123 in the codebase. Find every file that contains it.

The Bash Way

grep -rl 'password123' /opt/app --include='*.py' --include='*.yaml' --include='*.conf'
Flag What it does
-r Recursive
-l Print only filenames (not the matching lines)
--include Only search files matching this pattern

For more control, combine find with xargs:

find /opt/app -type f \( -name '*.py' -o -name '*.yaml' \) \
  -exec grep -l 'password123' {} +

grep -r vs find + grep:

grep -r is simpler but follows symlinks and searches binary files by default. find + grep gives you precise control over which files to search. For big codebases, use ripgrep (rg) — it's 10-50x faster and respects .gitignore.

The Python Way

from pathlib import Path

extensions = {".py", ".yaml", ".conf"}

for p in Path("/opt/app").rglob("*"):
    if p.is_file() and p.suffix in extensions:
        try:
            text = p.read_text(encoding="utf-8", errors="ignore")
            if "password123" in text:
                print(p)
        except (PermissionError, OSError):
            pass

For line-level detail:

from pathlib import Path

extensions = {".py", ".yaml", ".conf"}

for p in Path("/opt/app").rglob("*"):
    if p.is_file() and p.suffix in extensions:
        try:
            for i, line in enumerate(p.open(encoding="utf-8", errors="ignore"), 1):
                if "password123" in line:
                    print(f"{p}:{i}: {line.rstrip()}")
        except (PermissionError, OSError):
            pass

Side-by-Side

Bash (grep) Python
Speed Very fast (C implementation) Slower (Python I/O loop)
Binary safety grep -I skips binary files Need errors="ignore" or binary check
Regex Built-in (grep -E, grep -P) import re
Output Filename, line number, matching line Whatever you want
Next step Pipe to sed for replacement Use str.replace() or re.sub()

Part 6: The File Operations Rosetta Stone

Task Bash Python
List files ls, find . -maxdepth 1 Path(".").iterdir()
Recursive list find . -type f Path(".").rglob("*")
Check if exists [ -f file ], [ -d dir ] p.exists(), p.is_file(), p.is_dir()
File size stat -c%s file or wc -c < file p.stat().st_size
File age find -mtime, stat -c%Y p.stat().st_mtime
Read file cat file or < file p.read_text()
Write file echo "x" > file p.write_text("x")
Copy cp src dst shutil.copy2(src, dst)
Move/rename mv src dst p.rename(dst) or shutil.move()
Delete file rm file p.unlink()
Delete directory rm -rf dir shutil.rmtree(dir)
Create directory mkdir -p dir p.mkdir(parents=True, exist_ok=True)
Temp file mktemp tempfile.NamedTemporaryFile()
Temp directory mktemp -d tempfile.TemporaryDirectory()
Permissions chmod 644 file p.chmod(0o644)
Owner chown user:group file os.chown(path, uid, gid)
Symlink ln -s target link p.symlink_to(target)
Resolve symlink readlink -f link p.resolve()
Basename basename /a/b/c.txtc.txt p.namec.txt
Extension ${f##*.}txt p.suffix.txt
Parent dir dirname /a/b/c.txt/a/b p.parent/a/b

Flashcard Check

Cover the answers and test yourself.

Q1: find . -name '*.log' -mtime +30 — what does +30 mean?

Modified more than 30 days ago. -30 means less than 30 days. 30 (no sign) means exactly 30 days ago.

Q2: What's the difference between find ... {} \; and find ... {} +?

\; runs one command per file (slow). + batches files into fewer commands (fast). It's the same as piping through xargs.

Q3: ${f%.BACKUP} — what does the % do?

Removes the shortest match of .BACKUP from the end of $f. %% removes the longest match. # and ## remove from the start.

Q4: Python Path(".").rglob("*.py") — what does rglob stand for?

Recursive glob. It's equivalent to glob("**/*.py"). Plain glob() is non-recursive (current directory only).

Q5: Why is find / -delete -name '*.log' dangerous?

find evaluates predicates left to right. -delete comes before -name, so it deletes EVERYTHING, then the name filter matches nothing. Always put -delete last.

Q6: stat.S_IWOTH — what permission bit does this represent?

Others (world) write permission. The S_IW = write, OTH = others. Numeric value: 0o002.

Q7: pathlib.Path.with_suffix(".bak") — does it modify the file?

No. It returns a new Path object with the suffix changed. You must call .rename() to actually rename the file on disk.

Q8: When should you use xargs instead of find -exec?

When you need more control over batching (-n), parallelism (-P), or when piping from a command other than find. For find alone, -exec ... {} + does the same batching as xargs.


Exercises

Exercise 1: The Disk Hog (bash)

Write a one-liner that finds the 10 largest files under /var and shows their size in human-readable format.

Hint 1 `find` can filter by size with `-size`, but for "largest" you need to sort.
Hint 2 Use `find -type f -printf '%s %p\n'` to get size in bytes with the path.
Solution
find /var -type f -printf '%s %p\n' 2>/dev/null | sort -rn | head -10
For human-readable sizes, pipe through `awk` or `numfmt`:
find /var -type f -printf '%s %p\n' 2>/dev/null \
  | sort -rn | head -10 \
  | numfmt --to=iec --field=1

Exercise 2: Port It (bash → python)

Rewrite this bash script as Python using pathlib:

find /opt/app -name '*.tmp' -mtime +7 -type f -exec rm {} +
echo "Cleaned up old temp files"

Your Python version should print each file it deletes.

Hint 1 Use `Path.rglob("*.tmp")` and compare `st_mtime` to a cutoff timestamp.
Hint 2
from datetime import datetime, timedelta
cutoff = (datetime.now() - timedelta(days=7)).timestamp()
Solution
from pathlib import Path
from datetime import datetime, timedelta

cutoff = (datetime.now() - timedelta(days=7)).timestamp()

count = 0
for p in Path("/opt/app").rglob("*.tmp"):
    if p.is_file() and p.stat().st_mtime < cutoff:
        print(f"deleting: {p}")
        p.unlink()
        count += 1

print(f"Cleaned up {count} old temp files")

Exercise 3: Port It (python → bash)

Rewrite this Python script as a bash one-liner or short script:

from pathlib import Path
from collections import Counter

extensions = Counter(
    p.suffix.lower()
    for p in Path(".").rglob("*")
    if p.is_file() and p.suffix
)

for ext, count in extensions.most_common(10):
    print(f"{count:>6}  {ext}")

(It counts file extensions in the current directory tree.)

Hint 1 Extract the extension with bash parameter expansion or `awk`.
Hint 2 `find . -type f -name '*.*'` gets files with extensions. Then extract the extension.
Solution
find . -type f -name '*.*' | awk -F. '{print "." tolower($NF)}' \
  | sort | uniq -c | sort -rn | head -10
Or using `sed`:
find . -type f -name '*.*' \
  | sed 's/.*\./\./' \
  | tr '[:upper:]' '[:lower:]' \
  | sort | uniq -c | sort -rn | head -10

Exercise 4: The Permission Audit

Write a script (bash or Python — your choice) that: 1. Scans a directory recursively 2. Finds all files with permissions more permissive than 0644 3. Outputs a report: filename, current permissions, recommended permissions

Hint A file is "more permissive than 0644" if any of these bits are set: group write, others write, or any execute bit (for non-scripts). Use `stat` to get the mode.
Solution (bash)
find /opt/app -type f ! -perm 0644 ! -perm 0755 -printf '%m %p\n' \
  | while read mode file; do
      echo "$file: current=$mode recommended=644"
    done
Solution (Python)
import stat
from pathlib import Path

for p in Path("/opt/app").rglob("*"):
    if not p.is_file():
        continue
    mode = p.stat().st_mode & 0o777
    if mode not in (0o644, 0o755):
        print(f"{p}: current={oct(mode)[2:]} recommended=644")

Exercise 5: The Decision (think, don't code)

For each task, decide: bash or Python? Justify your choice.

  1. Find all .env files in a project tree
  2. Reorganize 5,000 photos into YYYY/MM/DD/ directories based on EXIF data
  3. Check if a config file exists before starting a service
  4. Compare two directory trees and report differences
  5. Create a ZIP archive of files modified in the last 24 hours
Answers 1. **Bash.** `find . -name '.env' -type f` — one command, done. 2. **Python.** EXIF parsing requires a library (`Pillow` or `exifread`), date logic, directory creation, and error handling for corrupt files. Bash would need `exiftool` and complex string manipulation. 3. **Bash.** `[ -f /etc/app/config.yaml ] && systemctl start app` — this is what shell scripting was made for. 4. **Either.** Bash: `diff -rq dir1/ dir2/` is built-in and fast. Python: `filecmp.dircmp()` gives programmatic access. Pick based on whether you need the result as data. 5. **Bash.** `find . -mtime -1 -type f -print0 | xargs -0 zip archive.zip` — one pipeline. Python works too (`zipfile` module) but it's more code for no benefit.

Key Takeaways

  1. find is a Swiss Army knife. It filters by name, size, age, type, permissions, and owner — all in one command. Learn its predicates cold.

  2. pathlib is Python's modern file API. Use it instead of os.path for new code. It's cleaner, composes better, and handles edge cases (spaces, unicode) natively.

  3. Quoting in bash is non-negotiable. Always quote "$variable" — unquoted variables break on spaces, globbing characters, and empty strings.

  4. find -exec {} + batches like xargs. Use + instead of \; for performance. Only use \; when you need exactly one invocation per file.

  5. Permission audits should be scripted, not manual. Whether you use find -perm or Python's stat, automate it so it runs on every deploy or as a cron job.


What's Next

  • Lesson 3: Process Management — &/wait/trap vs subprocess/asyncio/signal
  • Lesson 4: Data Wrangling — jq/cut/sort vs json/csv/pandas one-liners
  • Lesson 5: Error Handling — set -euo pipefail vs try/except, retry patterns in both