The Ops of AI/ML Workloads - Street-Level Ops¶

What experienced GPU cluster operators know that the NVIDIA documentation buries in footnotes.

Quick Diagnosis Commands¶

# GPU health check (run on every GPU node)
nvidia-smi                               # Overall status
nvidia-smi -q -d TEMPERATURE,POWER       # Thermal and power details
nvidia-smi -q -d ECC                     # ECC memory errors (data corruption)
nvidia-smi -q -d CLOCK                   # Clock speeds (throttled?)
nvidia-smi topo -m                       # GPU topology (NVLink connections)

# Check CUDA driver vs toolkit compatibility
nvidia-smi | head -3                      # Driver version
nvcc --version 2>/dev/null                # Toolkit version (if installed)
python3 -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"

# GPU memory usage per process
nvidia-smi pmon -c 1                     # Per-process GPU utilization
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv

# Kubernetes GPU status
kubectl describe nodes | grep -A5 nvidia.com/gpu
kubectl get pods -A -o wide | grep -i gpu

# Check for GPU errors in kernel log
dmesg | grep -i -E "nvrm|nvidia|xid"
# XID errors are GPU hardware/software errors:
#   XID 13: Graphics Engine Exception (GPU hung)
#   XID 31: GPU memory page fault
#   XID 48: Double Bit ECC Error (hardware failure)
#   XID 79: GPU fallen off the bus (PCIe issue)

# DCGM diagnostics (if installed)
dcgmi diag -r 1                          # Quick diagnostic
dcgmi diag -r 3                          # Comprehensive diagnostic (takes minutes)

Gotcha: NVIDIA Driver Breaks After Kernel Update¶

You ran apt upgrade on a GPU node. The kernel updated. On reboot, nvidia-smi returns "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver." All GPU pods are in CrashLoopBackOff.

Fix: The kernel module needs to be rebuilt for the new kernel:

# If using DKMS (recommended):
dkms status                               # Check if module built for new kernel
dkms autoinstall                          # Rebuild for current kernel
modprobe nvidia                            # Load the rebuilt module
nvidia-smi                                 # Verify

# If DKMS didn't work, reinstall the driver:
apt install --reinstall nvidia-driver-535
reboot

# Prevention: pin kernel version on GPU nodes
apt-mark hold linux-image-$(uname -r) linux-headers-$(uname -r)
# Or: use the NVIDIA GPU Operator in Kubernetes
# which manages driver lifecycle automatically

For Kubernetes, the NVIDIA GPU Operator handles driver installation and kernel compatibility automatically:

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --create-namespace \
  --set driver.version="535.129.03"

Gotcha: Pod OOMKilled But System RAM Is Fine¶

A GPU training pod is OOMKilled. You check node memory — plenty of free RAM. The OOM was on GPU memory (VRAM), not system memory. Kubernetes doesn't distinguish between the two in its termination message.

Fix: Check the actual failure:

# Check pod events
kubectl describe pod <pod-name> | grep -A10 "Events"

# Check container logs for CUDA OOM
kubectl logs <pod-name> --previous | grep -i "out of memory\|CUDA\|OOM"
# Typical error: "RuntimeError: CUDA out of memory.
#   Tried to allocate 2.00 GiB (GPU 0; 79.35 GiB total capacity;
#   76.42 GiB already allocated)"

# The fix depends on the cause:
# 1. Batch size too large:
#    Reduce batch_size in the training config
#    Rule of thumb: start with batch_size=1, double until OOM

# 2. Model too large for one GPU:
#    Use tensor parallelism (split across GPUs)
#    Or quantize the model (FP16 → INT8 → INT4)

# 3. Memory fragmentation:
#    Set env var: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Gotcha: GPU Shows 0% Utilization Despite Running Job¶

The training job is running. nvidia-smi shows 0% GPU utilization. The data scientist says "it's training." The GPU is idle because the data loading pipeline is the bottleneck — the CPU can't feed data to the GPU fast enough.

Fix: The job is CPU-bound, not GPU-bound:

# Check if CPU is the bottleneck
top                                       # CPU at 100%?
iostat -x 1                              # Disk I/O saturated?

# Solutions:
# 1. Increase DataLoader workers
#    num_workers=8 or num_workers=<num_cpus>

# 2. Ensure /dev/shm is large enough
#    PyTorch DataLoader uses shared memory for multiprocessing
kubectl patch deployment <name> --patch '{
  "spec":{"template":{"spec":{"containers":[{
    "name":"trainer",
    "volumeMounts":[{"name":"shm","mountPath":"/dev/shm"}]
  }],
  "volumes":[{"name":"shm","emptyDir":{"medium":"Memory","sizeLimit":"16Gi"}}]
  }}}}'

# 3. Pre-process data to a fast format (WebDataset, TFRecord)
# 4. Use NVMe local storage instead of NFS for training data
# 5. Ensure adequate CPU allocation (at least 4 CPUs per GPU)

Gotcha: Model Download Takes Forever on Pod Start¶

Every time a vLLM pod restarts, it downloads the model from HuggingFace (140GB for a 70B model). The pod takes 30 minutes to become ready. During rollout, you have reduced capacity for half an hour.

Fix: Cache models on a shared PVC:

# PVC for model cache (shared across replicas)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: hf-model-cache
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Ti
  storageClassName: nfs-client

# Use in deployment
# Mount at the HuggingFace cache directory
# Container env: HF_HOME=/models/huggingface
# Container volumeMount: /models → hf-model-cache PVC

Alternatively, pre-pull models with an init container or a CronJob:

# Pre-download model to shared storage
huggingface-cli download meta-llama/Llama-2-70b-chat-hf \
  --local-dir /models/llama-2-70b \
  --local-dir-use-symlinks False

Gotcha: MIG Configuration Doesn't Survive Reboot¶

You configured MIG on your A100s. The node reboots. MIG mode is disabled and your partitioned GPU resources are gone. Pods requesting MIG slices can't be scheduled.

Fix: Persist MIG configuration:

# Enable MIG persistence with nvidia-persistenced
systemctl enable nvidia-persistenced

# Create a startup script for MIG configuration
# /etc/systemd/system/nvidia-mig-config.service
[Unit]
Description=Configure NVIDIA MIG
After=nvidia-persistenced.service
Requires=nvidia-persistenced.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-smi -i 0 -mig 1
ExecStart=/usr/bin/nvidia-smi mig -i 0 -cgi 9,9,9,19 -C
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

# Or use the NVIDIA GPU Operator with MIG Manager
# which handles MIG configuration declaratively

Pattern: GPU Node Tainting and Toleration¶

Prevent non-GPU workloads from landing on expensive GPU nodes:

# Taint GPU nodes
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
kubectl taint nodes gpu-node-2 nvidia.com/gpu=present:NoSchedule

# GPU pods must tolerate the taint
# Add to pod spec:
tolerations:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    value: "present"
    effect: "NoSchedule"
nodeSelector:
  nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"

Also label GPU nodes by GPU type for selective scheduling:

kubectl label nodes gpu-node-1 gpu-type=a100
kubectl label nodes gpu-node-2 gpu-type=t4
# Training jobs → a100 nodes
# Inference jobs → t4 nodes (cheaper)

Pattern: GPU Cluster Cost Monitoring¶

GPUs are expensive. Track utilization to justify the spend:

# Prometheus alert: GPU sitting idle (wasting money)
# An idle A100 on AWS costs ~$10/hour
- alert: GPUIdleWastingMoney
  expr: DCGM_FI_DEV_GPU_UTIL < 5
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "GPU {{ $labels.gpu }} on {{ $labels.instance }} idle for 1 hour"
    description: "At $10/hr, this idle GPU costs $240/day. Consider scaling down or scheduling a job."

# Track GPU utilization trends in Grafana
# Panel: average GPU utilization across fleet
# Target: >60% average utilization (below = over-provisioned)
# Compare: GPU cost per month vs. GPU utilization %

Emergency: XID 79 — GPU Fell Off the Bus¶

nvidia-smi shows "GPU has fallen off the bus" or dmesg shows "XID 79." The GPU is no longer accessible to the system. This is usually a hardware issue (PCIe link failure, overheating, power supply issue).

1. Check dmesg for details:
   dmesg | grep -i "xid\|nvrm\|nvidia\|pcie"

2. Try a GPU reset (may work for transient issues):
   nvidia-smi -r -i <gpu-id>        # Reset specific GPU

3. If reset fails, reboot the node:
   # Drain the node first
   kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
   reboot

4. If the GPU doesn't come back after reboot:
   - Check PCIe slot seating (physical access required)
   - Check power connectors to the GPU
   - Check GPU temperature logs (was it overheating?)
   - RMA the GPU if hardware failure confirmed

5. Kubernetes recovery:
   kubectl uncordon <node>           # After reboot
   kubectl get nodes | grep <node>   # Verify Ready
   kubectl describe node <node> | grep nvidia.com/gpu  # GPU visible?

Emergency: All GPUs Thermal Throttling¶

Training jobs are 40% slower than expected. nvidia-smi shows GPU temperatures at 85°C+ and clock speeds are reduced.

1. Check temperatures:
   nvidia-smi -q -d TEMPERATURE
   # Target: <80°C under sustained load
   # Throttle: starts ~83°C, severe at 90°C

2. Immediate mitigation:
   # Reduce GPU power limit (lower heat, lower performance)
   nvidia-smi -pl 250                    # Limit to 250W (from 400W default on A100)

3. Root cause investigation:
   - Ambient temperature in the server room/rack
   - Fan failures (check BMC/IPMI)
   - Blocked airflow (cabling, blanking panels missing)
   - Cooling system failure (CRAC/CRAH units)
   - Too many GPUs in adjacent slots without spacing

4. Long-term fixes:
   - Improve rack airflow (hot/cold aisle containment)
   - Add supplemental cooling
   - Space out GPU nodes in the rack
   - Use liquid cooling for dense GPU deployments