Python Debugging — Street-Level Ops¶

Real-world workflows for debugging hung processes, memory leaks, import failures, and production issues in Python services.

Debugging a Hung Python Process¶

When a Python process is alive but not responding — no logs, no errors, just stuck.

# Step 1: find the PID
ps aux | grep python
# or
pgrep -af 'uvicorn|gunicorn|celery'

# Step 2: see what it's doing RIGHT NOW with py-spy
# py-spy doesn't require instrumentation — attaches to a running process
pip install py-spy

# Live top-like view of the call stack
py-spy top --pid 12345

# Sample output:
#   %Own   %Total  OwnTime  TotalTime  Function (filename:line)
#   90.0%  90.0%   18.00s   18.00s     acquire (threading.py:350)
#    5.0%  95.0%    1.00s   19.00s     execute (sqlalchemy/engine/default.py:741)

# Record a flame graph (SVG) for 30 seconds
py-spy record -o profile.svg --pid 12345 --duration 30

# Step 3: strace — see what system calls it's making
strace -p 12345 -e trace=network,read,write -f -t 2>&1 | head -100

# Common findings:
# - Stuck on read() from a socket = waiting for network response (DB? API?)
# - Stuck on futex() = waiting on a lock (deadlock?)
# - Stuck on poll() with long timeout = event loop waiting, but nothing arriving

# Step 4: check /proc for clues
cat /proc/12345/status | grep -E 'State|Threads|VmRSS'
# State:  S (sleeping)
# Threads:  8
# VmRSS:  524288 kB

ls -la /proc/12345/fd | wc -l     # open file descriptors
cat /proc/12345/net/tcp | wc -l   # open TCP connections

# Step 5: get a Python-level stack trace without py-spy
# Send SIGUSR1 if the app has faulthandler enabled
kill -SIGUSR1 12345

# Or use GDB (last resort)
gdb -p 12345 -batch -ex 'py-bt'

Enabling faulthandler for Production¶

# Add to your app entry point — dumps traceback on SIGSEGV, SIGABRT, SIGUSR1
import faulthandler
import signal

faulthandler.enable()                         # dump on crash
faulthandler.register(signal.SIGUSR1)         # dump on SIGUSR1 (kill -USR1 PID)

# Now you can trigger a traceback dump from the host:
# kill -USR1 <pid>
# Output goes to stderr (which goes to docker logs)

Finding Memory Leaks in Long-Running Services¶

A Python service starts at 200MB RSS and grows to 2GB over a week. No OOMKill yet, but it's coming.

# Option 1: tracemalloc — built into Python, zero dependencies
# Add to your app startup:
import tracemalloc
tracemalloc.start(25)  # store 25 frames of traceback

# Add a debug endpoint to dump top allocators:
from fastapi import FastAPI
app = FastAPI()

@app.get("/debug/memory")
def memory_snapshot():
    snapshot = tracemalloc.take_snapshot()
    top = snapshot.statistics("lineno")
    return [
        {"file": str(stat.traceback), "size_kb": stat.size / 1024, "count": stat.count}
        for stat in top[:20]
    ]

# Option 2: objgraph — find what's accumulating
pip install objgraph

# In a Python shell or debug endpoint:
import objgraph
objgraph.show_growth(limit=10)
# Output:
# dict        12345   +2345
# list         8901   +1023
# MyModel      5678   +5678   <-- THIS is your leak

# Show what's holding references to leaked objects
objgraph.show_backrefs(
    objgraph.by_type('MyModel')[:3],
    filename='refs.png',
    max_depth=5
)

# Option 3: quick RSS monitoring from outside the process
while true; do
    RSS=$(ps -o rss= -p 12345)
    echo "$(date +%H:%M:%S) RSS: $((RSS / 1024)) MB"
    sleep 60
done

Common Leak Patterns¶

# Leak 1: unbounded cache / global dict that never gets pruned
_cache = {}
def get_user(user_id):
    if user_id not in _cache:
        _cache[user_id] = db.fetch(user_id)  # never evicted
    return _cache[user_id]

# Fix: use functools.lru_cache with maxsize, or cachetools.TTLCache
from cachetools import TTLCache
_cache = TTLCache(maxsize=10000, ttl=300)

# Leak 2: event handlers that accumulate
class EventBus:
    def __init__(self):
        self.handlers = []
    def subscribe(self, handler):
        self.handlers.append(handler)  # never unsubscribed

# Leak 3: logging handlers added repeatedly
for request in requests:
    logger.addHandler(StreamHandler())  # adds a new handler every request!

Debugging Import Errors¶

# "ModuleNotFoundError: No module named 'mypackage'" — but you INSTALLED it

# Step 1: check which Python is running
which python3
python3 --version
python3 -c "import sys; print(sys.executable)"

# Step 2: check sys.path — where Python looks for modules
python3 -c "import sys; print('\n'.join(sys.path))"

# Step 3: check if the package is installed in the RIGHT environment
python3 -m pip show mypackage
# If "Location:" doesn't match sys.path, that's your problem

# Step 4: verbose import tracing
python3 -v -c "import mypackage" 2>&1 | tail -30
# Shows every file Python tries to import from, in order

# Step 5: check for broken virtualenv
ls -la $VIRTUAL_ENV/lib/python*/site-packages/ | head -20

# Common fix: the virtualenv was created with a different Python version
# than the one currently active
python3 -c "import sys; print(sys.prefix)"  # should be your venv

Docker-Specific Import Debugging¶

# Common mistake: installing in builder stage but running in runtime stage
# without copying site-packages correctly

# WRONG:
FROM python:3.11 AS builder
RUN pip install fastapi uvicorn
FROM python:3.11-slim
COPY --from=builder /app /app
CMD ["python", "-m", "uvicorn", "app:main"]
# ModuleNotFoundError: fastapi not found (site-packages wasn't copied)

# CORRECT:
FROM python:3.11 AS builder
RUN pip install --target=/install fastapi uvicorn
FROM python:3.11-slim
COPY --from=builder /install /usr/local/lib/python3.11/site-packages/
COPY . /app
CMD ["python", "-m", "uvicorn", "app.main:app"]

Debugging Encoding Issues¶

# Symptom: UnicodeDecodeError, garbled output, or mojibake in logs

# Step 1: check what encoding Python thinks it's using
import sys
print(sys.getdefaultencoding())       # usually 'utf-8'
print(sys.getfilesystemencoding())    # usually 'utf-8'
print(sys.stdout.encoding)           # might be 'ascii' in Docker!

# Step 2: the Docker gotcha — LANG not set in minimal images
# In your Dockerfile:
# ENV LANG=C.UTF-8
# ENV LC_ALL=C.UTF-8
# ENV PYTHONIOENCODING=utf-8

# Step 3: read files with explicit encoding
with open("data.csv", encoding="utf-8") as f:
    data = f.read()

# Step 4: detect encoding of unknown files
# pip install chardet
import chardet
with open("mystery.csv", "rb") as f:
    raw = f.read(10000)
    result = chardet.detect(raw)
    print(result)  # {'encoding': 'ISO-8859-1', 'confidence': 0.73}

# Step 5: fix mojibake (data was decoded wrong, then re-encoded)
broken = "Ã©"  # This was supposed to be "é"
fixed = broken.encode('latin-1').decode('utf-8')
print(fixed)  # é

Debugging Hanging Async Code¶

# Symptom: FastAPI/asyncio service stops responding, no errors in logs

# Step 1: find blocking calls in async code
# This is the #1 cause of hung async services — a sync call in an async function
# blocks the entire event loop

# BAD: this blocks the entire event loop for every request
@app.get("/users/{user_id}")
async def get_user(user_id: int):
    user = db.query(User).filter(User.id == user_id).first()  # SYNC call!
    return user

# GOOD: run sync code in a thread pool
import asyncio
from functools import partial

@app.get("/users/{user_id}")
async def get_user(user_id: int):
    loop = asyncio.get_event_loop()
    user = await loop.run_in_executor(
        None, partial(db.query(User).filter(User.id == user_id).first)
    )
    return user

# Step 2: detect event loop blocking with debug mode
import asyncio
asyncio.get_event_loop().set_debug(True)
asyncio.get_event_loop().slow_callback_duration = 0.1
# WARNING: Executing <Task ...> took 2.5 seconds

# Step 3: dump all running tasks
import asyncio
for task in asyncio.all_tasks():
    print(f"Task: {task.get_name()}")
    task.print_stack()

# Step 4: use py-spy to see where the event loop is stuck
# py-spy top --pid <pid>
# If you see a single thread stuck in a C extension or I/O call,
# that's your blocking call

Profiling Slow FastAPI Endpoints¶

# Option 1: py-spy — zero overhead profiling of a running service
py-spy record -o profile.svg --pid $(pgrep -f uvicorn) --duration 60

# Option 2: line-level profiling with line_profiler
pip install line_profiler

# Decorate the slow function:
# from line_profiler import profile
# @profile
# def slow_endpoint_logic():
#     ...

# Run with:
# kernprof -l -v app.py

# Option 3: middleware-based timing for all endpoints
import time
import logging
from starlette.middleware.base import BaseHTTPMiddleware

logger = logging.getLogger(__name__)

class TimingMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next):
        start = time.perf_counter()
        response = await call_next(request)
        elapsed = time.perf_counter() - start
        response.headers["X-Process-Time"] = f"{elapsed:.4f}"
        if elapsed > 1.0:
            logger.warning(
                f"SLOW: {request.method} {request.url.path} took {elapsed:.2f}s"
            )
        return response

app.add_middleware(TimingMiddleware)

# Option 4: cProfile for a specific code path
import cProfile
import pstats
import io

def profile_endpoint():
    pr = cProfile.Profile()
    pr.enable()

    # ... your slow code here ...
    result = expensive_operation()

    pr.disable()
    s = io.StringIO()
    ps = pstats.Stats(pr, stream=s).sort_stats("cumulative")
    ps.print_stats(20)
    print(s.getvalue())
    return result

Debugging Segfaults in C Extensions¶

# Symptom: Python process just dies. No traceback. Exit code 139 (SIGSEGV).

# Step 1: enable faulthandler (shows Python traceback on crash)
PYTHONFAULTHANDLER=1 python3 app.py
# or in code:
# import faulthandler; faulthandler.enable()

# Step 2: if faulthandler isn't enough, use gdb
gdb -ex run -ex bt --args python3 app.py
# When it crashes, gdb stops and you can see the C-level backtrace

# Step 3: common culprits
# - numpy/scipy version mismatch with system BLAS/LAPACK
# - pillow compiled against missing shared libs
# - lxml with libxml2 version mismatch
# - any package with C extensions installed via pip on a different arch

# Step 4: verify shared library dependencies
python3 -c "import numpy; print(numpy.__file__)"
ldd /path/to/numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so
# Look for "not found" entries

# Step 5: Docker-specific — Alpine + pip = pain
# Alpine uses musl libc, not glibc. Many wheels are compiled for glibc.
# pip builds from source on Alpine, which often fails or produces broken binaries.
# Fix: use python:3.11-slim (Debian), not python:3.11-alpine

Log-Based Debugging for Production¶

When you can't attach a debugger and can't reproduce locally.

import logging
import json
import traceback

# Structured logging — machine-parseable, searchable
logging.basicConfig(
    format='{"time":"%(asctime)s","level":"%(levelname)s","logger":"%(name)s","msg":%(message)s}',
    level=logging.INFO,
)
logger = logging.getLogger(__name__)

# Log with context — every field is searchable in your log aggregator
def process_order(order_id: str, user_id: str):
    ctx = {"order_id": order_id, "user_id": user_id}
    logger.info(json.dumps({**ctx, "event": "order_started"}))
    try:
        result = charge_payment(order_id)
        logger.info(json.dumps({**ctx, "event": "payment_charged", "amount": result.amount}))
    except PaymentError as e:
        logger.error(json.dumps({
            **ctx,
            "event": "payment_failed",
            "error": str(e),
            "traceback": traceback.format_exc(),
        }))
        raise

# Search structured logs with jq
docker logs myapp 2>&1 | jq 'select(.event == "payment_failed")'
docker logs myapp 2>&1 | jq 'select(.order_id == "ord-12345")'

# Tail logs for a specific error pattern
docker logs -f myapp 2>&1 | jq 'select(.level == "ERROR")'

Debugging Python in Docker¶

# Exec into the container and run a Python shell
docker exec -it myapp python3
# >>> import app.api.main as m
# >>> m.get_health()  # call functions directly

# Install debugging tools inside a running container
docker exec -it myapp bash
pip install py-spy ipython objgraph
py-spy top --pid 1  # PID 1 is usually the main process in Docker

# Run a one-off debug container with the same image
docker run --rm -it --entrypoint bash myapp:latest
# python3 -c "from app.api.config import settings; print(settings.dict())"

# Attach to a running container's process with strace
docker exec -it --privileged myapp strace -p 1 -e trace=network -f

# Use breakpoint() in dev (not prod!) with Docker Compose
# docker-compose.yml:
# services:
#   app:
#     stdin_open: true   # needed for breakpoint()
#     tty: true          # needed for breakpoint()
#     environment:
#       - PYTHONBREAKPOINT=ipdb.set_trace

# Then attach to the container when it hits the breakpoint:
# docker attach myapp

# Debug environment issues
docker exec myapp env | sort
docker exec myapp python3 -c "import sys; print(sys.path)"
docker exec myapp python3 -m pip list

Quick Reference: Which Tool When¶

Symptom	Tool	Command
Process hung, no output	py-spy	`py-spy top --pid PID`
Process hung, need syscalls	strace	`strace -p PID -e trace=network,read,write`
Memory growing over time	tracemalloc	Add to code, hit `/debug/memory` endpoint
What objects are accumulating	objgraph	`objgraph.show_growth()`
Import not working	python -v	`python3 -v -c "import mypackage"`
Segfault, no traceback	faulthandler	`PYTHONFAULTHANDLER=1 python3 app.py`
Slow endpoint	py-spy record	`py-spy record -o profile.svg --pid PID`
Async event loop blocked	asyncio debug	`loop.set_debug(True)`
Production errors, no repro	structured logging	JSON logs + jq queries
Docker-specific issue	docker exec	`docker exec -it myapp python3`