Portal | Level: L2: Operations | Topics: AI/ML Infrastructure Ops, Kubernetes Core, AI Tools for DevOps | Domain: DevOps & Tooling
The Ops of AI/ML Workloads - Primer¶
Why This Matters¶
You don't need to understand how transformers work to keep a GPU cluster running. But you do need to understand GPU scheduling, CUDA driver management, model serving infrastructure, and why a data scientist's "it works on my laptop" model deployment is about to OOM your most expensive nodes. AI/ML workloads are the fastest-growing category of infrastructure spend, and the ops patterns are different enough from traditional web services that your existing Kubernetes muscle memory will mislead you.
GPU hardware is expensive ($10K-$40K per card), GPU time is scarce, and ML engineers have a fundamentally different relationship with infrastructure than web developers. They want big machines, lots of memory, long-running jobs, and they want them now. Your job is to make the GPU cluster reliable, efficient, and not bankrupt the company.
Core Concepts¶
1. GPU Architecture for Ops People¶
You don't need to understand matrix multiplication, but you need to understand the hardware hierarchy:
┌─────────────────────────────────────────────────────┐
│ GPU Node (bare metal or cloud instance) │
│ │
│ CPU: 64 cores, 256GB RAM │
│ ├── PCIe bus │
│ │ ├── GPU 0 (NVIDIA A100 80GB) │
│ │ ├── GPU 1 (NVIDIA A100 80GB) │
│ │ ├── GPU 2 (NVIDIA A100 80GB) │
│ │ └── GPU 3 (NVIDIA A100 80GB) │
│ │ │
│ │ NVLink (GPU-to-GPU high-speed interconnect) │
│ │ GPU 0 ←→ GPU 1 ←→ GPU 2 ←→ GPU 3 │
│ │ 600 GB/s (much faster than PCIe) │
│ │ │
│ Storage: NVMe SSDs + NFS mount for model storage │
│ Network: 25/100 Gbps (data loading, distributed) │
└─────────────────────────────────────────────────────┘
Key metrics you need to monitor:
├── GPU utilization (%) — is the GPU actually computing?
├── GPU memory used (GB) — how close to OOM?
├── GPU temperature (°C) — thermal throttling?
├── GPU power draw (W) — capacity planning
├── PCIe throughput (GB/s) — data feeding the GPU fast enough?
└── NVLink throughput (GB/s) — multi-GPU communication
Common GPU hardware in production:
| GPU | VRAM | Use Case | Cloud Instance |
|---|---|---|---|
| NVIDIA T4 | 16 GB | Inference, light training | AWS g4dn |
| NVIDIA A10G | 24 GB | Inference, fine-tuning | AWS g5 |
| NVIDIA A100 | 40/80 GB | Training, heavy inference | AWS p4d |
| NVIDIA H100 | 80 GB | Large model training | AWS p5 |
| NVIDIA L4 | 24 GB | Inference, video | AWS g6 |
| NVIDIA L40S | 48 GB | Mixed training/inference | Various |
2. CUDA Driver Management¶
The CUDA stack is the most fragile part of GPU operations:
Application (PyTorch, TensorFlow, vLLM)
│
▼
CUDA Toolkit (nvcc, libraries: cuBLAS, cuDNN)
│
▼
CUDA Driver (kernel module: nvidia.ko)
│
▼
GPU Hardware (NVIDIA A100, H100, etc.)
Version compatibility is STRICT:
├── CUDA Toolkit version must match the application's build
├── CUDA Driver must be >= the toolkit version
├── Driver must support the GPU hardware generation
└── Kernel version must be compatible with the driver
# Check current NVIDIA driver and CUDA versions
nvidia-smi # Driver version, GPU status
nvcc --version # CUDA toolkit version
# Common driver installation (Ubuntu)
# Option 1: Package manager (recommended for servers)
apt install nvidia-driver-535 # Specific version
apt install nvidia-headless-535-server # No X11 (servers)
# Option 2: NVIDIA's runfile (when you need specific versions)
chmod +x NVIDIA-Linux-x86_64-535.129.03.run
./NVIDIA-Linux-x86_64-535.129.03.run --silent
# DKMS: automatically rebuilds driver on kernel updates
apt install nvidia-dkms-535
# CRITICAL: test driver after kernel updates
# A kernel update can break the NVIDIA module
# Add to your update playbook:
modprobe nvidia
nvidia-smi
# If nvidia-smi fails after kernel update, rebuild DKMS:
dkms autoinstall
3. GPU Scheduling in Kubernetes¶
Kubernetes doesn't natively understand GPUs. You need the NVIDIA device plugin:
# Install NVIDIA device plugin (DaemonSet)
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
# Verify GPUs are visible to Kubernetes
kubectl describe node gpu-node-1 | grep nvidia.com/gpu
# nvidia.com/gpu: 4 (allocatable)
# nvidia.com/gpu: 4 (capacity)
# Request GPUs in a pod spec
apiVersion: v1
kind: Pod
metadata:
name: training-job
spec:
containers:
- name: trainer
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
resources:
limits:
nvidia.com/gpu: 2 # Request 2 GPUs
memory: "64Gi"
cpu: "16"
requests:
nvidia.com/gpu: 2
memory: "64Gi"
cpu: "16"
volumeMounts:
- name: model-storage
mountPath: /models
- name: shm # CRITICAL for PyTorch DataLoader
mountPath: /dev/shm
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
- name: shm # Shared memory for data loading
emptyDir:
medium: Memory
sizeLimit: "16Gi"
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
GPU scheduling is all-or-nothing by default: if a pod requests 1 GPU, it gets exclusive access to that entire GPU. No sharing, no overcommit.
Gotcha: Unlike CPU and memory, GPU requests and limits must be equal. You cannot request 0.5 GPUs or set
requests: 1, limits: 2. GPUs are non-compressible, non-divisible integer resources in Kubernetes. The NVIDIA device plugin advertises whole GPUs only.
4. GPU Time-Slicing and MIG¶
Sharing GPUs between workloads:
Time-Slicing (temporal sharing):
# NVIDIA device plugin config for time-slicing
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin
namespace: kube-system
data:
config: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # Each physical GPU appears as 4 virtual GPUs
MIG (Multi-Instance GPU) — hardware partitioning on A100/H100:
# Enable MIG mode on GPU 0
nvidia-smi -i 0 -mig 1
# Create GPU instances (A100 80GB example)
# Split into 7 instances of ~10GB each:
nvidia-smi mig -i 0 -cgi 19,19,19,19,19,19,19 -C
# Or 3 instances of ~20GB + 1 instance of ~10GB:
nvidia-smi mig -i 0 -cgi 9,9,9,19 -C
# List MIG instances
nvidia-smi mig -lgi
nvidia-smi mig -lci
# Each MIG instance is a separate GPU resource in Kubernetes
# nvidia.com/mig-1g.10gb, nvidia.com/mig-2g.20gb, etc.
Name origin: MIG stands for Multi-Instance GPU. It was introduced with the NVIDIA A100 (Ampere architecture, 2020). The "instances" are hardware-level partitions with dedicated memory, cache, and compute units — not time-sliced virtualizations. Think of it as physically splitting one GPU into smaller independent GPUs.
MIG provides true memory isolation — one workload can't OOM another. Ideal for inference serving where each model needs a predictable slice of GPU memory.
5. Model Serving Infrastructure¶
Deploying trained models as API endpoints:
┌─────────────────┐
│ Load Balancer │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌────────▼──┐ ┌───────▼───┐ ┌───────▼───┐
│ vLLM │ │ vLLM │ │ vLLM │
│ replica │ │ replica │ │ replica │
│ (GPU 0) │ │ (GPU 1) │ │ (GPU 2) │
└───────────┘ └───────────┘ └───────────┘
vLLM (for LLM inference):
# Deploy vLLM serving Llama 2 7B
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 \
--max-model-len 4096 \
--port 8000
# Kubernetes deployment for vLLM
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 3
selector:
matchLabels:
app: vllm
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-2-7b-chat-hf"
- "--gpu-memory-utilization"
- "0.90"
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
requests:
nvidia.com/gpu: 1
memory: "32Gi"
ports:
- containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120 # Models take time to load
periodSeconds: 10
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc # Cache downloaded models
NVIDIA Triton Inference Server (multi-framework):
# Triton supports TensorRT, ONNX, PyTorch, TensorFlow
# Model repository structure:
models/
├── resnet50/
│ ├── config.pbtxt
│ └── 1/
│ └── model.onnx
└── bert-base/
├── config.pbtxt
└── 1/
└── model.pt
# Run Triton
docker run --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v /models:/models \
nvcr.io/nvidia/tritonserver:23.10-py3 \
tritonserver --model-repository=/models
6. Storage for Large Models¶
Models are big. LLaMA 2 70B is ~140GB. GPT-scale models are terabytes:
Storage Hierarchy for ML:
├── Model weights (read-heavy, large files)
│ ├── NFS/NAS: simple, shared across nodes
│ ├── S3/GCS + local cache: scalable, slower first load
│ └── PVC with ReadWriteMany: Kubernetes-native
│
├── Training data (read-heavy, massive)
│ ├── S3/GCS with streaming: don't copy to local
│ ├── Lustre/GPFS: parallel filesystem for HPC
│ └── NFS with SSD cache: good enough for small datasets
│
└── Checkpoints (write-heavy during training)
├── Local NVMe: fastest, lost on pod restart
├── PVC: persists across restarts
└── S3 + periodic sync: durable, slower
# PVC for model storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache-pvc
spec:
accessModes:
- ReadWriteMany # Multiple pods read same models
resources:
requests:
storage: 500Gi # Models are large
storageClassName: nfs-client # NFS for shared access
7. Monitoring GPU Workloads¶
# nvidia-smi is your primary GPU monitoring tool
nvidia-smi # Snapshot
nvidia-smi -l 1 # Refresh every 1 second
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw --format=csv -l 5
# DCGM (Data Center GPU Manager) for Prometheus
# Install DCGM exporter as a DaemonSet
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace monitoring
# Key Prometheus metrics from DCGM:
# DCGM_FI_DEV_GPU_UTIL — GPU compute utilization %
# DCGM_FI_DEV_FB_USED — GPU memory used (MB)
# DCGM_FI_DEV_FB_FREE — GPU memory free (MB)
# DCGM_FI_DEV_GPU_TEMP — GPU temperature
# DCGM_FI_DEV_POWER_USAGE — Power draw (watts)
# DCGM_FI_DEV_SM_CLOCK — SM clock speed (throttling?)
# DCGM_FI_DEV_XID_ERRORS — GPU hardware errors (XID)
Key alerts for GPU clusters:
| Alert | Condition | Severity |
|---|---|---|
| GPU memory near limit | Used > 90% of VRAM | Warning |
| GPU temperature high | > 83°C sustained | Warning |
| GPU utilization low | < 10% for 30min (wasting money) | Info |
| XID errors detected | Any XID error count > 0 | Critical |
| CUDA OOM | Pod restart with OOMKilled | High |
| GPU not detected | nvidia-smi fails | Critical |
8. Common GPU OOM Patterns¶
GPU memory (VRAM) OOM is different from system RAM OOM:
GPU OOM causes:
├── Batch size too large (most common)
│ Fix: reduce batch_size in training config
│
├── Model doesn't fit in GPU memory
│ Fix: use model parallelism, quantization, or a bigger GPU
│
├── Memory fragmentation (PyTorch)
│ Fix: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
│
├── Memory leak in training loop
│ Fix: ensure tensors are detached/deleted after use
│
└── Multiple users sharing GPU via time-slicing
Fix: use MIG for memory isolation, or set per-user memory limits
# Monitor GPU memory in real-time during a job
watch -n 1 nvidia-smi
# Check if OOM killed a pod
kubectl describe pod <pod-name> | grep -A5 "Last State"
# Look for: "OOMKilled" in termination reason
# Check dmesg for CUDA/NVIDIA errors
dmesg | grep -i -E "nvrm|nvidia|cuda|xid"
Common Pitfalls¶
Debug clue: When
nvidia-smishows GPU utilization at 0% but the training job claims to be running, the most common cause is a CUDA version mismatch — the PyTorch build was compiled against a different CUDA version than the driver supports. Check withpython -c "import torch; print(torch.cuda.is_available())"inside the container.
- Not mounting /dev/shm properly. PyTorch DataLoader uses shared memory for multiprocess data loading. Default
/dev/shmin Kubernetes is 64MB. A training job with 8 data workers will crash. Mount an emptyDir withmedium: Memoryat/dev/shm. - Forgetting that model loading takes minutes. A 70B parameter model takes 2-5 minutes to load into GPU memory. Set
initialDelaySecondson readiness probes accordingly. Scale-up is not instant — preload models during off-peak. - Ignoring GPU temperature under sustained load. Training jobs run for days. GPUs thermal-throttle above 83°C, silently reducing performance by 20-40%. Monitor temperature and ensure adequate cooling.
- Scheduling GPU and non-GPU pods on the same node without taints. Without taints, Kubernetes will schedule CPU-only pods on your expensive GPU nodes, consuming CPU and memory that GPU workloads need for data loading. Taint GPU nodes and tolerate only GPU workloads.
- Not caching downloaded models. Every pod restart re-downloads a 140GB model from HuggingFace. Use a PVC to cache model files. One download, many pod restarts.
- Treating GPU nodes like CPU nodes for resource planning. A GPU node with 4x A100s and 64 CPU cores might only run 4 pods (one per GPU). The CPU and RAM are there to feed the GPUs, not to run additional workloads.
Wiki Navigation¶
Prerequisites¶
- Kubernetes Ops (Production) (Topic Pack, L2)
Related Content¶
- AI Tools for DevOps (Topic Pack, L1) — AI Tools for DevOps
- AI-Assisted DevOps Cookbook (Reference, L1) — AI Tools for DevOps
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
- Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
- Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
- Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
- Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core