Scripting Rosetta — Lesson 2: File Operations¶
Bundle: Bash + Python + CLI Tools Level: L1–L2 (Foundations → Operations) Time: 45–60 minutes Prerequisites: Lesson 1 (text processing), basic terminal comfort
What You'll Learn¶
By the end of this lesson you'll be able to:
- Find files by name, size, age, and content in both bash and Python
- Bulk rename files using shell loops and pathlib
- Audit file permissions in both languages
- Know when find/xargs beats Python and vice versa
Part 1: The Mission¶
You've inherited a messy project directory. Thousands of files, no naming convention, stale logs eating disk, and some files have permissions wide open (0777). Your job:
- Find all
.logfiles older than 30 days - Bulk rename
*.BACKUPfiles to*.bak - Find files with dangerous permissions (world-writable)
- Locate all files containing a hardcoded password string
Let's solve each one two ways.
Part 2: Find Old Log Files¶
The Bash Way¶
One command. Let's unpack it:
| Piece | What it does |
|---|---|
find /var/log |
Start searching from this directory, recursively |
-name '*.log' |
Match files ending in .log (case-sensitive) |
-mtime +30 |
Modified more than 30 days ago |
-type f |
Regular files only (skip directories, symlinks) |
Want to also see the size? Add -ls:
Want to delete them? Add -delete — but always preview first:
# Preview what would be deleted
find /var/log -name '*.log' -mtime +30 -type f -print
# Then delete (CAREFUL — no undo)
find /var/log -name '*.log' -mtime +30 -type f -delete
Safety Rule: Never put
-deletebefore your filter predicates.find /var/log -delete -name '*.log'deletes EVERYTHING first, then filters nothing.findevaluates left to right. This has destroyed real production systems.
The Python Way¶
from pathlib import Path
from datetime import datetime, timedelta
cutoff = datetime.now().timestamp() - (30 * 86400)
for p in Path("/var/log").rglob("*.log"):
if p.is_file() and p.stat().st_mtime < cutoff:
print(p)
| Piece | What it does |
|---|---|
Path("/var/log") |
Create a path object rooted at /var/log |
.rglob("*.log") |
Recursive glob — like find -name '*.log' |
.is_file() |
Skip directories and symlinks |
.stat().st_mtime |
File's last modification time (epoch seconds) |
To delete:
for p in Path("/var/log").rglob("*.log"):
if p.is_file() and p.stat().st_mtime < cutoff:
p.unlink()
print(f"deleted: {p}")
Side-by-Side¶
Bash (find) |
Python (pathlib) |
|
|---|---|---|
| Lines of code | 1 | 6 |
| Time filter | -mtime +30 (days, built-in) |
Manual epoch math |
| Pattern matching | -name, -iname, -regex |
.glob(), .rglob() |
| Recursion | Automatic | .rglob() or os.walk() |
| Safety | -delete is irrecoverable |
.unlink() is irrecoverable |
| When to pick it | Quick cleanup from the terminal | Part of a larger script with logging |
pathlibvsos.path: Python has two file APIs.os.pathis the old way (string-based).pathlib(Python 3.4+) is the modern way (object-oriented). Usepathlibfor new code — it's cleaner and composes better.
Part 3: Bulk Rename Files¶
You have 200 files like report.BACKUP, data.BACKUP, config.BACKUP and you want
them renamed to .bak.
The Bash Way¶
The magic is ${f%.BACKUP} — bash parameter expansion:
| Syntax | What it does |
|---|---|
${var%pattern} |
Remove shortest match of pattern from the END |
${var%%pattern} |
Remove longest match from the END |
${var#pattern} |
Remove shortest match from the START |
${var##pattern} |
Remove longest match from the START |
So ${f%.BACKUP} strips .BACKUP from the end, then we append .bak.
For a recursive rename, combine with find:
Why
{} +instead of{} \;?
\;runs one command per file:mv file1,mv file2,mv file3— 200 fork/execs.+batches files into one command:bash -c '...' _ file1 file2 file3— much faster. This is the same idea asxargsbatching.
The Python Way¶
That's it. Two lines. pathlib makes rename operations trivial:
| Method | What it does |
|---|---|
p.with_suffix(".bak") |
Same path, different extension |
p.with_name("new.txt") |
Same directory, different filename |
p.with_stem("new") |
Same directory and extension, different stem (3.9+) |
p.rename(target) |
Move/rename the file |
Side-by-Side¶
| Bash | Python | |
|---|---|---|
| Simple case | for f in *.X; do mv ... |
for p in Path(".").glob("*.X"): p.rename(...) |
| Recursive | find + exec or find + xargs |
.rglob() |
| Dry run | echo instead of mv |
print() instead of .rename() |
| Edge cases | Must quote "$f" for spaces |
pathlib handles spaces natively |
| Undo | None (write a reverse script) | None (write a reverse script) |
The #1 Bash Rename Mistake: Forgetting quotes.
# WRONG — breaks on filenames with spaces for f in *.BACKUP; do mv $f ${f%.BACKUP}.bak; done # RIGHT — always quote variables in bash for f in *.BACKUP; do mv "$f" "${f%.BACKUP}.bak"; doneIn Python, this bug doesn't exist —
pathlibtreats paths as objects, not strings.
Part 4: Find World-Writable Files¶
Files with 0777 or any world-writable permission are a security risk.
The Bash Way¶
| Piece | What it does |
|---|---|
-perm -o=w |
"Others" have write permission (the - means "at least these bits") |
-not -path '/proc/*' |
Skip virtual filesystems |
2>/dev/null |
Suppress "Permission denied" errors |
Variants:
# Exact 0777
find / -type f -perm 0777
# SUID bit set (another security risk)
find / -type f -perm -u=s
# Group-writable in a specific directory
find /opt/app -type f -perm -g=w
Permission Bits Refresher:
The Python Way¶
import stat
from pathlib import Path
for p in Path("/opt/app").rglob("*"):
if p.is_file():
mode = p.stat().st_mode
if mode & stat.S_IWOTH: # others-write bit
print(f"{oct(mode)[-3:]} {p}")
| Constant | Meaning |
|---|---|
stat.S_IWOTH |
Others write |
stat.S_IWGRP |
Group write |
stat.S_ISUID |
SUID bit |
stat.S_ISGID |
SGID bit |
stat.S_ISVTX |
Sticky bit |
To fix permissions:
import stat
from pathlib import Path
for p in Path("/opt/app").rglob("*"):
if p.is_file():
mode = p.stat().st_mode
if mode & stat.S_IWOTH:
new_mode = mode & ~stat.S_IWOTH # clear the bit
p.chmod(new_mode)
print(f"fixed: {p} ({oct(mode)[-3:]} → {oct(new_mode)[-3:]})")
Side-by-Side¶
Bash (find -perm) |
Python (stat) |
|
|---|---|---|
| Readability | -perm -o=w is concise but cryptic |
stat.S_IWOTH is verbose but self-documenting |
| Speed | Fast — find is C, optimized for this |
Slower — Python stat() call per file |
| Fixing | find ... -exec chmod o-w {} + |
p.chmod(new_mode) |
| Reporting | Pipe to tee or xargs ls -la |
Build a dict, generate CSV, send alert |
When Bash Wins: Quick audit — "do any world-writable files exist?" One command, done.
When Python Wins: You need to generate a compliance report, fix permissions AND log what changed, or integrate with a ticketing system.
Part 5: Find Files Containing a String¶
Someone hardcoded password123 in the codebase. Find every file that contains it.
The Bash Way¶
| Flag | What it does |
|---|---|
-r |
Recursive |
-l |
Print only filenames (not the matching lines) |
--include |
Only search files matching this pattern |
For more control, combine find with xargs:
grep -rvsfind + grep:
grep -ris simpler but follows symlinks and searches binary files by default.find + grepgives you precise control over which files to search. For big codebases, useripgrep(rg) — it's 10-50x faster and respects.gitignore.
The Python Way¶
from pathlib import Path
extensions = {".py", ".yaml", ".conf"}
for p in Path("/opt/app").rglob("*"):
if p.is_file() and p.suffix in extensions:
try:
text = p.read_text(encoding="utf-8", errors="ignore")
if "password123" in text:
print(p)
except (PermissionError, OSError):
pass
For line-level detail:
from pathlib import Path
extensions = {".py", ".yaml", ".conf"}
for p in Path("/opt/app").rglob("*"):
if p.is_file() and p.suffix in extensions:
try:
for i, line in enumerate(p.open(encoding="utf-8", errors="ignore"), 1):
if "password123" in line:
print(f"{p}:{i}: {line.rstrip()}")
except (PermissionError, OSError):
pass
Side-by-Side¶
Bash (grep) |
Python | |
|---|---|---|
| Speed | Very fast (C implementation) | Slower (Python I/O loop) |
| Binary safety | grep -I skips binary files |
Need errors="ignore" or binary check |
| Regex | Built-in (grep -E, grep -P) |
import re |
| Output | Filename, line number, matching line | Whatever you want |
| Next step | Pipe to sed for replacement |
Use str.replace() or re.sub() |
Part 6: The File Operations Rosetta Stone¶
| Task | Bash | Python |
|---|---|---|
| List files | ls, find . -maxdepth 1 |
Path(".").iterdir() |
| Recursive list | find . -type f |
Path(".").rglob("*") |
| Check if exists | [ -f file ], [ -d dir ] |
p.exists(), p.is_file(), p.is_dir() |
| File size | stat -c%s file or wc -c < file |
p.stat().st_size |
| File age | find -mtime, stat -c%Y |
p.stat().st_mtime |
| Read file | cat file or < file |
p.read_text() |
| Write file | echo "x" > file |
p.write_text("x") |
| Copy | cp src dst |
shutil.copy2(src, dst) |
| Move/rename | mv src dst |
p.rename(dst) or shutil.move() |
| Delete file | rm file |
p.unlink() |
| Delete directory | rm -rf dir |
shutil.rmtree(dir) |
| Create directory | mkdir -p dir |
p.mkdir(parents=True, exist_ok=True) |
| Temp file | mktemp |
tempfile.NamedTemporaryFile() |
| Temp directory | mktemp -d |
tempfile.TemporaryDirectory() |
| Permissions | chmod 644 file |
p.chmod(0o644) |
| Owner | chown user:group file |
os.chown(path, uid, gid) |
| Symlink | ln -s target link |
p.symlink_to(target) |
| Resolve symlink | readlink -f link |
p.resolve() |
| Basename | basename /a/b/c.txt → c.txt |
p.name → c.txt |
| Extension | ${f##*.} → txt |
p.suffix → .txt |
| Parent dir | dirname /a/b/c.txt → /a/b |
p.parent → /a/b |
Flashcard Check¶
Cover the answers and test yourself.
Q1: find . -name '*.log' -mtime +30 — what does +30 mean?
Modified more than 30 days ago.
-30means less than 30 days.30(no sign) means exactly 30 days ago.
Q2: What's the difference between find ... {} \; and find ... {} +?
\;runs one command per file (slow).+batches files into fewer commands (fast). It's the same as piping throughxargs.
Q3: ${f%.BACKUP} — what does the % do?
Removes the shortest match of
.BACKUPfrom the end of$f.%%removes the longest match.#and##remove from the start.
Q4: Python Path(".").rglob("*.py") — what does rglob stand for?
Recursive glob. It's equivalent to
glob("**/*.py"). Plainglob()is non-recursive (current directory only).
Q5: Why is find / -delete -name '*.log' dangerous?
findevaluates predicates left to right.-deletecomes before-name, so it deletes EVERYTHING, then the name filter matches nothing. Always put-deletelast.
Q6: stat.S_IWOTH — what permission bit does this represent?
Others (world) write permission. The
S_IW= write,OTH= others. Numeric value:0o002.
Q7: pathlib.Path.with_suffix(".bak") — does it modify the file?
No. It returns a new
Pathobject with the suffix changed. You must call.rename()to actually rename the file on disk.
Q8: When should you use xargs instead of find -exec?
When you need more control over batching (
-n), parallelism (-P), or when piping from a command other thanfind. Forfindalone,-exec ... {} +does the same batching asxargs.
Exercises¶
Exercise 1: The Disk Hog (bash)¶
Write a one-liner that finds the 10 largest files under /var and shows their
size in human-readable format.
Hint 1
`find` can filter by size with `-size`, but for "largest" you need to sort.Hint 2
Use `find -type f -printf '%s %p\n'` to get size in bytes with the path.Solution
For human-readable sizes, pipe through `awk` or `numfmt`:Exercise 2: Port It (bash → python)¶
Rewrite this bash script as Python using pathlib:
Your Python version should print each file it deletes.
Hint 1
Use `Path.rglob("*.tmp")` and compare `st_mtime` to a cutoff timestamp.Hint 2
Solution
from pathlib import Path
from datetime import datetime, timedelta
cutoff = (datetime.now() - timedelta(days=7)).timestamp()
count = 0
for p in Path("/opt/app").rglob("*.tmp"):
if p.is_file() and p.stat().st_mtime < cutoff:
print(f"deleting: {p}")
p.unlink()
count += 1
print(f"Cleaned up {count} old temp files")
Exercise 3: Port It (python → bash)¶
Rewrite this Python script as a bash one-liner or short script:
from pathlib import Path
from collections import Counter
extensions = Counter(
p.suffix.lower()
for p in Path(".").rglob("*")
if p.is_file() and p.suffix
)
for ext, count in extensions.most_common(10):
print(f"{count:>6} {ext}")
(It counts file extensions in the current directory tree.)
Hint 1
Extract the extension with bash parameter expansion or `awk`.Hint 2
`find . -type f -name '*.*'` gets files with extensions. Then extract the extension.Solution
Or using `sed`:Exercise 4: The Permission Audit¶
Write a script (bash or Python — your choice) that:
1. Scans a directory recursively
2. Finds all files with permissions more permissive than 0644
3. Outputs a report: filename, current permissions, recommended permissions
Hint
A file is "more permissive than 0644" if any of these bits are set: group write, others write, or any execute bit (for non-scripts). Use `stat` to get the mode.Solution (bash)
Solution (Python)
Exercise 5: The Decision (think, don't code)¶
For each task, decide: bash or Python? Justify your choice.
- Find all
.envfiles in a project tree - Reorganize 5,000 photos into
YYYY/MM/DD/directories based on EXIF data - Check if a config file exists before starting a service
- Compare two directory trees and report differences
- Create a ZIP archive of files modified in the last 24 hours
Answers
1. **Bash.** `find . -name '.env' -type f` — one command, done. 2. **Python.** EXIF parsing requires a library (`Pillow` or `exifread`), date logic, directory creation, and error handling for corrupt files. Bash would need `exiftool` and complex string manipulation. 3. **Bash.** `[ -f /etc/app/config.yaml ] && systemctl start app` — this is what shell scripting was made for. 4. **Either.** Bash: `diff -rq dir1/ dir2/` is built-in and fast. Python: `filecmp.dircmp()` gives programmatic access. Pick based on whether you need the result as data. 5. **Bash.** `find . -mtime -1 -type f -print0 | xargs -0 zip archive.zip` — one pipeline. Python works too (`zipfile` module) but it's more code for no benefit.Key Takeaways¶
-
findis a Swiss Army knife. It filters by name, size, age, type, permissions, and owner — all in one command. Learn its predicates cold. -
pathlibis Python's modern file API. Use it instead ofos.pathfor new code. It's cleaner, composes better, and handles edge cases (spaces, unicode) natively. -
Quoting in bash is non-negotiable. Always quote
"$variable"— unquoted variables break on spaces, globbing characters, and empty strings. -
find -exec {} +batches likexargs. Use+instead of\;for performance. Only use\;when you need exactly one invocation per file. -
Permission audits should be scripted, not manual. Whether you use
find -permor Python'sstat, automate it so it runs on every deploy or as a cron job.
What's Next¶
- Lesson 3: Process Management —
&/wait/trapvssubprocess/asyncio/signal - Lesson 4: Data Wrangling —
jq/cut/sortvsjson/csv/pandas one-liners - Lesson 5: Error Handling —
set -euo pipefailvstry/except, retry patterns in both