Python Debugging — Street-Level Ops¶
Real-world workflows for debugging hung processes, memory leaks, import failures, and production issues in Python services.
Debugging a Hung Python Process¶
When a Python process is alive but not responding — no logs, no errors, just stuck.
# Step 1: find the PID
ps aux | grep python
# or
pgrep -af 'uvicorn|gunicorn|celery'
# Step 2: see what it's doing RIGHT NOW with py-spy
# py-spy doesn't require instrumentation — attaches to a running process
pip install py-spy
# Live top-like view of the call stack
py-spy top --pid 12345
# Sample output:
# %Own %Total OwnTime TotalTime Function (filename:line)
# 90.0% 90.0% 18.00s 18.00s acquire (threading.py:350)
# 5.0% 95.0% 1.00s 19.00s execute (sqlalchemy/engine/default.py:741)
# Record a flame graph (SVG) for 30 seconds
py-spy record -o profile.svg --pid 12345 --duration 30
# Step 3: strace — see what system calls it's making
strace -p 12345 -e trace=network,read,write -f -t 2>&1 | head -100
# Common findings:
# - Stuck on read() from a socket = waiting for network response (DB? API?)
# - Stuck on futex() = waiting on a lock (deadlock?)
# - Stuck on poll() with long timeout = event loop waiting, but nothing arriving
# Step 4: check /proc for clues
cat /proc/12345/status | grep -E 'State|Threads|VmRSS'
# State: S (sleeping)
# Threads: 8
# VmRSS: 524288 kB
ls -la /proc/12345/fd | wc -l # open file descriptors
cat /proc/12345/net/tcp | wc -l # open TCP connections
# Step 5: get a Python-level stack trace without py-spy
# Send SIGUSR1 if the app has faulthandler enabled
kill -SIGUSR1 12345
# Or use GDB (last resort)
gdb -p 12345 -batch -ex 'py-bt'
Enabling faulthandler for Production¶
# Add to your app entry point — dumps traceback on SIGSEGV, SIGABRT, SIGUSR1
import faulthandler
import signal
faulthandler.enable() # dump on crash
faulthandler.register(signal.SIGUSR1) # dump on SIGUSR1 (kill -USR1 PID)
# Now you can trigger a traceback dump from the host:
# kill -USR1 <pid>
# Output goes to stderr (which goes to docker logs)
Finding Memory Leaks in Long-Running Services¶
A Python service starts at 200MB RSS and grows to 2GB over a week. No OOMKill yet, but it's coming.
# Option 1: tracemalloc — built into Python, zero dependencies
# Add to your app startup:
import tracemalloc
tracemalloc.start(25) # store 25 frames of traceback
# Add a debug endpoint to dump top allocators:
from fastapi import FastAPI
app = FastAPI()
@app.get("/debug/memory")
def memory_snapshot():
snapshot = tracemalloc.take_snapshot()
top = snapshot.statistics("lineno")
return [
{"file": str(stat.traceback), "size_kb": stat.size / 1024, "count": stat.count}
for stat in top[:20]
]
# Option 2: objgraph — find what's accumulating
pip install objgraph
# In a Python shell or debug endpoint:
import objgraph
objgraph.show_growth(limit=10)
# Output:
# dict 12345 +2345
# list 8901 +1023
# MyModel 5678 +5678 <-- THIS is your leak
# Show what's holding references to leaked objects
objgraph.show_backrefs(
objgraph.by_type('MyModel')[:3],
filename='refs.png',
max_depth=5
)
# Option 3: quick RSS monitoring from outside the process
while true; do
RSS=$(ps -o rss= -p 12345)
echo "$(date +%H:%M:%S) RSS: $((RSS / 1024)) MB"
sleep 60
done
Common Leak Patterns¶
# Leak 1: unbounded cache / global dict that never gets pruned
_cache = {}
def get_user(user_id):
if user_id not in _cache:
_cache[user_id] = db.fetch(user_id) # never evicted
return _cache[user_id]
# Fix: use functools.lru_cache with maxsize, or cachetools.TTLCache
from cachetools import TTLCache
_cache = TTLCache(maxsize=10000, ttl=300)
# Leak 2: event handlers that accumulate
class EventBus:
def __init__(self):
self.handlers = []
def subscribe(self, handler):
self.handlers.append(handler) # never unsubscribed
# Leak 3: logging handlers added repeatedly
for request in requests:
logger.addHandler(StreamHandler()) # adds a new handler every request!
Debugging Import Errors¶
# "ModuleNotFoundError: No module named 'mypackage'" — but you INSTALLED it
# Step 1: check which Python is running
which python3
python3 --version
python3 -c "import sys; print(sys.executable)"
# Step 2: check sys.path — where Python looks for modules
python3 -c "import sys; print('\n'.join(sys.path))"
# Step 3: check if the package is installed in the RIGHT environment
python3 -m pip show mypackage
# If "Location:" doesn't match sys.path, that's your problem
# Step 4: verbose import tracing
python3 -v -c "import mypackage" 2>&1 | tail -30
# Shows every file Python tries to import from, in order
# Step 5: check for broken virtualenv
ls -la $VIRTUAL_ENV/lib/python*/site-packages/ | head -20
# Common fix: the virtualenv was created with a different Python version
# than the one currently active
python3 -c "import sys; print(sys.prefix)" # should be your venv
Docker-Specific Import Debugging¶
# Common mistake: installing in builder stage but running in runtime stage
# without copying site-packages correctly
# WRONG:
FROM python:3.11 AS builder
RUN pip install fastapi uvicorn
FROM python:3.11-slim
COPY --from=builder /app /app
CMD ["python", "-m", "uvicorn", "app:main"]
# ModuleNotFoundError: fastapi not found (site-packages wasn't copied)
# CORRECT:
FROM python:3.11 AS builder
RUN pip install --target=/install fastapi uvicorn
FROM python:3.11-slim
COPY --from=builder /install /usr/local/lib/python3.11/site-packages/
COPY . /app
CMD ["python", "-m", "uvicorn", "app.main:app"]
Debugging Encoding Issues¶
# Symptom: UnicodeDecodeError, garbled output, or mojibake in logs
# Step 1: check what encoding Python thinks it's using
import sys
print(sys.getdefaultencoding()) # usually 'utf-8'
print(sys.getfilesystemencoding()) # usually 'utf-8'
print(sys.stdout.encoding) # might be 'ascii' in Docker!
# Step 2: the Docker gotcha — LANG not set in minimal images
# In your Dockerfile:
# ENV LANG=C.UTF-8
# ENV LC_ALL=C.UTF-8
# ENV PYTHONIOENCODING=utf-8
# Step 3: read files with explicit encoding
with open("data.csv", encoding="utf-8") as f:
data = f.read()
# Step 4: detect encoding of unknown files
# pip install chardet
import chardet
with open("mystery.csv", "rb") as f:
raw = f.read(10000)
result = chardet.detect(raw)
print(result) # {'encoding': 'ISO-8859-1', 'confidence': 0.73}
# Step 5: fix mojibake (data was decoded wrong, then re-encoded)
broken = "é" # This was supposed to be "é"
fixed = broken.encode('latin-1').decode('utf-8')
print(fixed) # é
Debugging Hanging Async Code¶
# Symptom: FastAPI/asyncio service stops responding, no errors in logs
# Step 1: find blocking calls in async code
# This is the #1 cause of hung async services — a sync call in an async function
# blocks the entire event loop
# BAD: this blocks the entire event loop for every request
@app.get("/users/{user_id}")
async def get_user(user_id: int):
user = db.query(User).filter(User.id == user_id).first() # SYNC call!
return user
# GOOD: run sync code in a thread pool
import asyncio
from functools import partial
@app.get("/users/{user_id}")
async def get_user(user_id: int):
loop = asyncio.get_event_loop()
user = await loop.run_in_executor(
None, partial(db.query(User).filter(User.id == user_id).first)
)
return user
# Step 2: detect event loop blocking with debug mode
import asyncio
asyncio.get_event_loop().set_debug(True)
asyncio.get_event_loop().slow_callback_duration = 0.1
# WARNING: Executing <Task ...> took 2.5 seconds
# Step 3: dump all running tasks
import asyncio
for task in asyncio.all_tasks():
print(f"Task: {task.get_name()}")
task.print_stack()
# Step 4: use py-spy to see where the event loop is stuck
# py-spy top --pid <pid>
# If you see a single thread stuck in a C extension or I/O call,
# that's your blocking call
Profiling Slow FastAPI Endpoints¶
# Option 1: py-spy — zero overhead profiling of a running service
py-spy record -o profile.svg --pid $(pgrep -f uvicorn) --duration 60
# Option 2: line-level profiling with line_profiler
pip install line_profiler
# Decorate the slow function:
# from line_profiler import profile
# @profile
# def slow_endpoint_logic():
# ...
# Run with:
# kernprof -l -v app.py
# Option 3: middleware-based timing for all endpoints
import time
import logging
from starlette.middleware.base import BaseHTTPMiddleware
logger = logging.getLogger(__name__)
class TimingMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request, call_next):
start = time.perf_counter()
response = await call_next(request)
elapsed = time.perf_counter() - start
response.headers["X-Process-Time"] = f"{elapsed:.4f}"
if elapsed > 1.0:
logger.warning(
f"SLOW: {request.method} {request.url.path} took {elapsed:.2f}s"
)
return response
app.add_middleware(TimingMiddleware)
# Option 4: cProfile for a specific code path
import cProfile
import pstats
import io
def profile_endpoint():
pr = cProfile.Profile()
pr.enable()
# ... your slow code here ...
result = expensive_operation()
pr.disable()
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats("cumulative")
ps.print_stats(20)
print(s.getvalue())
return result
Debugging Segfaults in C Extensions¶
# Symptom: Python process just dies. No traceback. Exit code 139 (SIGSEGV).
# Step 1: enable faulthandler (shows Python traceback on crash)
PYTHONFAULTHANDLER=1 python3 app.py
# or in code:
# import faulthandler; faulthandler.enable()
# Step 2: if faulthandler isn't enough, use gdb
gdb -ex run -ex bt --args python3 app.py
# When it crashes, gdb stops and you can see the C-level backtrace
# Step 3: common culprits
# - numpy/scipy version mismatch with system BLAS/LAPACK
# - pillow compiled against missing shared libs
# - lxml with libxml2 version mismatch
# - any package with C extensions installed via pip on a different arch
# Step 4: verify shared library dependencies
python3 -c "import numpy; print(numpy.__file__)"
ldd /path/to/numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so
# Look for "not found" entries
# Step 5: Docker-specific — Alpine + pip = pain
# Alpine uses musl libc, not glibc. Many wheels are compiled for glibc.
# pip builds from source on Alpine, which often fails or produces broken binaries.
# Fix: use python:3.11-slim (Debian), not python:3.11-alpine
Log-Based Debugging for Production¶
When you can't attach a debugger and can't reproduce locally.
import logging
import json
import traceback
# Structured logging — machine-parseable, searchable
logging.basicConfig(
format='{"time":"%(asctime)s","level":"%(levelname)s","logger":"%(name)s","msg":%(message)s}',
level=logging.INFO,
)
logger = logging.getLogger(__name__)
# Log with context — every field is searchable in your log aggregator
def process_order(order_id: str, user_id: str):
ctx = {"order_id": order_id, "user_id": user_id}
logger.info(json.dumps({**ctx, "event": "order_started"}))
try:
result = charge_payment(order_id)
logger.info(json.dumps({**ctx, "event": "payment_charged", "amount": result.amount}))
except PaymentError as e:
logger.error(json.dumps({
**ctx,
"event": "payment_failed",
"error": str(e),
"traceback": traceback.format_exc(),
}))
raise
# Search structured logs with jq
docker logs myapp 2>&1 | jq 'select(.event == "payment_failed")'
docker logs myapp 2>&1 | jq 'select(.order_id == "ord-12345")'
# Tail logs for a specific error pattern
docker logs -f myapp 2>&1 | jq 'select(.level == "ERROR")'
Debugging Python in Docker¶
# Exec into the container and run a Python shell
docker exec -it myapp python3
# >>> import app.api.main as m
# >>> m.get_health() # call functions directly
# Install debugging tools inside a running container
docker exec -it myapp bash
pip install py-spy ipython objgraph
py-spy top --pid 1 # PID 1 is usually the main process in Docker
# Run a one-off debug container with the same image
docker run --rm -it --entrypoint bash myapp:latest
# python3 -c "from app.api.config import settings; print(settings.dict())"
# Attach to a running container's process with strace
docker exec -it --privileged myapp strace -p 1 -e trace=network -f
# Use breakpoint() in dev (not prod!) with Docker Compose
# docker-compose.yml:
# services:
# app:
# stdin_open: true # needed for breakpoint()
# tty: true # needed for breakpoint()
# environment:
# - PYTHONBREAKPOINT=ipdb.set_trace
# Then attach to the container when it hits the breakpoint:
# docker attach myapp
# Debug environment issues
docker exec myapp env | sort
docker exec myapp python3 -c "import sys; print(sys.path)"
docker exec myapp python3 -m pip list
Quick Reference: Which Tool When¶
| Symptom | Tool | Command |
|---|---|---|
| Process hung, no output | py-spy | py-spy top --pid PID |
| Process hung, need syscalls | strace | strace -p PID -e trace=network,read,write |
| Memory growing over time | tracemalloc | Add to code, hit /debug/memory endpoint |
| What objects are accumulating | objgraph | objgraph.show_growth() |
| Import not working | python -v | python3 -v -c "import mypackage" |
| Segfault, no traceback | faulthandler | PYTHONFAULTHANDLER=1 python3 app.py |
| Slow endpoint | py-spy record | py-spy record -o profile.svg --pid PID |
| Async event loop blocked | asyncio debug | loop.set_debug(True) |
| Production errors, no repro | structured logging | JSON logs + jq queries |
| Docker-specific issue | docker exec | docker exec -it myapp python3 |