Skip to content

AI/ML Ops Footguns

Mistakes that waste expensive GPU time, crash training jobs, or bring down your inference serving.


1. Not mounting /dev/shm for PyTorch workloads

You deploy a PyTorch training job on Kubernetes. It works for small datasets, then crashes with RuntimeError: DataLoader worker is killed by signal: Bus error when you scale up. Default /dev/shm in Kubernetes is 64MB. PyTorch DataLoader's multiprocess workers use shared memory, and 64MB is not enough.

Fix: Mount an emptyDir with medium: Memory at /dev/shm. Size it to at least 2x the number of DataLoader workers times the batch size in memory. 8Gi-16Gi is a safe starting point.

Default trap: Kubernetes defaults /dev/shm to 64MB (Docker's default). PyTorch DataLoader with num_workers > 0 uses shared memory for inter-process data transfer. The "Bus error" signal is the Linux kernel killing a process that exceeded the shared memory limit — the error message gives no hint that /dev/shm is the cause.


2. Scheduling CPU pods on GPU nodes

Without taints, Kubernetes treats GPU nodes as regular nodes with extra resources. Your GPU nodes end up running log shippers, monitoring agents, and batch jobs that consume CPU and memory meant for GPU workloads. A training job can't schedule because a dozen non-GPU pods ate 80% of the node's CPU.

Fix: Taint all GPU nodes: kubectl taint nodes <node> nvidia.com/gpu=present:NoSchedule. Only GPU-requesting pods with the matching toleration will be scheduled there. DaemonSets (monitoring, logging) need the toleration added explicitly.

Under the hood: GPU nodes are typically 5-10x more expensive than CPU-only nodes (an AWS p4d.24xlarge with 8 A100 GPUs costs ~$32/hour). A logging DaemonSet pod consuming 2 CPU cores and 4GB RAM on a GPU node costs your org 10x what the same pod costs on a regular node. Taints aren't just about scheduling correctness — they're about not burning $200/day on a Fluentd pod that runs fine on a $20/day node.


3. Letting training jobs run without checkpointing

Your training job runs for 72 hours. At hour 68, the node gets preempted (spot instance) or a GPU error kills the pod. 68 hours of compute time — thousands of dollars — gone. No checkpoints were saved.

Fix: Checkpoint every N steps to persistent storage (PVC, S3). Most ML frameworks support this natively. Set the checkpoint interval based on how much work you can afford to lose. For expensive training runs, checkpoint every 30-60 minutes. Store at least the last 3 checkpoints (in case the latest is corrupted).

Remember: The cost-of-lost-work formula: hourly_gpu_cost * hours_since_last_checkpoint. At $32/hour for 8 A100 GPUs, a 4-hour checkpoint gap means $128 wasted per interruption. Spot instances have a ~5% interruption rate per day. Over a 7-day training run, the expected cost of NOT checkpointing hourly far exceeds the I/O overhead of checkpointing.


4. Using GPU time-slicing for production inference

You enabled time-slicing to run 4 inference models on 1 GPU. Each model thinks it has the full GPU. Under load, all 4 models compete for GPU memory. One model's batch spikes, OOMs the GPU, and all 4 crash simultaneously. Your production inference is down.

Fix: Use MIG (Multi-Instance GPU) on A100/H100 for production inference. MIG provides hardware-level memory isolation — one model can't OOM another. Time-slicing is acceptable for development and testing, but never for production workloads where you need reliability.


5. Ignoring XID errors in dmesg

Your GPU nodes log XID errors occasionally. Nobody monitors dmesg. XID 48 (Double Bit ECC Error) has been firing for a week. The GPU is producing incorrect computation results. Your model is training on corrupted data, and the resulting model will be garbage. A week of A100 time wasted.

Fix: Monitor for XID errors and alert immediately. XID 48 and XID 63 indicate hardware failures — the GPU needs to be replaced. XID 13 and XID 31 may be software issues that a driver reset can fix. Set up a Prometheus exporter that scrapes dmesg for NVIDIA XID events, or use DCGM which exposes XID counts as metrics.

Debug clue: nvidia-smi -q -d ECC shows correctable and uncorrectable ECC error counts. Correctable errors are normal in small numbers. A rapid increase in correctable errors is a leading indicator of an impending uncorrectable (XID 48) failure. Track dcgm_fi_dev_ecc_sbe_aggregate_total in Prometheus and alert on the rate of change, not just the absolute count.


6. Downloading the model from HuggingFace on every pod restart

Your vLLM deployment has 3 replicas. Each pulls a 70B model (140GB) from HuggingFace on startup. A rolling deployment means 3 sequential 140GB downloads. Each restart takes 30 minutes. Your model serving is unavailable for 90 minutes during an update that should take 5 minutes.

Fix: Cache models on a shared PVC (NFS, EFS) or pre-populate a PVC with the model files. Set HF_HOME to the PVC mount path. The first pod downloads, subsequent pods and restarts read from cache. For air-gapped environments, use an internal model registry.


7. Setting GPU resource requests without limits (or vice versa)

You set resources.requests.nvidia.com/gpu: 1 but no limit. Or you set limits but not requests. Unlike CPU and memory, GPU resources in Kubernetes don't support overcommit — requests and limits must match. If they don't match, the pod won't schedule.

Fix: For GPU resources, always set requests equal to limits. nvidia.com/gpu is an extended resource with no support for fractional or overcommitted allocation. One GPU means one GPU — no partial shares (unless using time-slicing or MIG at the device plugin level).


8. Running nvidia-smi in a container without device access

You add nvidia-smi to your container for debugging. The container starts but nvidia-smi shows "no devices found." Your Kubernetes deployment doesn't have GPU resource requests, so the NVIDIA device plugin didn't mount the GPU device into the container.

Fix: GPU access in Kubernetes requires nvidia.com/gpu in the resource spec. Without it, the GPU device files (/dev/nvidia*) are not mounted into the container. For debugging containers that need GPU visibility without using GPU compute, deploy them on GPU nodes with host device access (but this bypasses Kubernetes resource accounting — use sparingly).


9. Not accounting for model loading time in readiness probes

Your vLLM deployment has readinessProbe.initialDelaySeconds: 10. The model takes 3 minutes to load into GPU memory. Kubernetes marks the pod as unhealthy after 10 seconds and keeps restarting it. The pod never becomes ready because it's killed before the model finishes loading.

Fix: Set initialDelaySeconds to at least 2x the expected model load time. For a 70B model: 300-600 seconds. Use a startup probe instead of extending the readiness delay — startup probes allow a longer initial window without affecting ongoing health checking.


10. Kernel update on GPU nodes without testing the driver

You have automatic security patching enabled on your GPU nodes. A kernel update lands. The NVIDIA kernel module isn't compatible with the new kernel. On the next reboot, nvidia-smi fails. All GPU workloads on that node crash. If multiple nodes update simultaneously, your entire GPU fleet goes down.

Fix: Pin the kernel version on GPU nodes. Test kernel updates manually: update one node, reboot, verify nvidia-smi works, then roll out to the rest. Use the NVIDIA GPU Operator, which manages driver builds automatically via DKMS. Add a post-reboot validation step to your maintenance playbook that checks nvidia-smi before uncordoning the node.

Gotcha: The NVIDIA driver is a kernel module compiled against a specific kernel version. A kernel update invalidates the compiled module. DKMS is supposed to rebuild automatically, but DKMS failures are silent unless you check /var/lib/dkms/nvidia/*/build/make.log. Always verify nvidia-smi returns successfully after any kernel update before putting the node back in service.