Skip to content

Portal | Level: L2: Operations | Topics: AI/ML Infrastructure Ops, Kubernetes Core, AI Tools for DevOps | Domain: DevOps & Tooling

The Ops of AI/ML Workloads - Primer

Why This Matters

You don't need to understand how transformers work to keep a GPU cluster running. But you do need to understand GPU scheduling, CUDA driver management, model serving infrastructure, and why a data scientist's "it works on my laptop" model deployment is about to OOM your most expensive nodes. AI/ML workloads are the fastest-growing category of infrastructure spend, and the ops patterns are different enough from traditional web services that your existing Kubernetes muscle memory will mislead you.

GPU hardware is expensive ($10K-$40K per card), GPU time is scarce, and ML engineers have a fundamentally different relationship with infrastructure than web developers. They want big machines, lots of memory, long-running jobs, and they want them now. Your job is to make the GPU cluster reliable, efficient, and not bankrupt the company.

Core Concepts

1. GPU Architecture for Ops People

You don't need to understand matrix multiplication, but you need to understand the hardware hierarchy:

┌─────────────────────────────────────────────────────┐
  GPU Node (bare metal or cloud instance)                                                                    CPU: 64 cores, 256GB RAM                             ├── PCIe bus                                             ├── GPU 0 (NVIDIA A100 80GB)                        ├── GPU 1 (NVIDIA A100 80GB)                        ├── GPU 2 (NVIDIA A100 80GB)                        └── GPU 3 (NVIDIA A100 80GB)                                                                             NVLink (GPU-to-GPU high-speed interconnect)         GPU 0 ←→ GPU 1 ←→ GPU 2 ←→ GPU 3                  600 GB/s (much faster than PCIe)                                                                      Storage: NVMe SSDs + NFS mount for model storage     Network: 25/100 Gbps (data loading, distributed)   └─────────────────────────────────────────────────────┘

Key metrics you need to monitor:
├── GPU utilization (%)        is the GPU actually computing?
├── GPU memory used (GB)       how close to OOM?
├── GPU temperature (°C)       thermal throttling?
├── GPU power draw (W)         capacity planning
├── PCIe throughput (GB/s)     data feeding the GPU fast enough?
└── NVLink throughput (GB/s)   multi-GPU communication

Common GPU hardware in production:

GPU VRAM Use Case Cloud Instance
NVIDIA T4 16 GB Inference, light training AWS g4dn
NVIDIA A10G 24 GB Inference, fine-tuning AWS g5
NVIDIA A100 40/80 GB Training, heavy inference AWS p4d
NVIDIA H100 80 GB Large model training AWS p5
NVIDIA L4 24 GB Inference, video AWS g6
NVIDIA L40S 48 GB Mixed training/inference Various

2. CUDA Driver Management

The CUDA stack is the most fragile part of GPU operations:

Application (PyTorch, TensorFlow, vLLM)
CUDA Toolkit (nvcc, libraries: cuBLAS, cuDNN)
CUDA Driver (kernel module: nvidia.ko)
GPU Hardware (NVIDIA A100, H100, etc.)

Version compatibility is STRICT:
├── CUDA Toolkit version must match the application's build
├── CUDA Driver must be >= the toolkit version
├── Driver must support the GPU hardware generation
└── Kernel version must be compatible with the driver
# Check current NVIDIA driver and CUDA versions
nvidia-smi                              # Driver version, GPU status
nvcc --version                          # CUDA toolkit version

# Common driver installation (Ubuntu)
# Option 1: Package manager (recommended for servers)
apt install nvidia-driver-535           # Specific version
apt install nvidia-headless-535-server  # No X11 (servers)

# Option 2: NVIDIA's runfile (when you need specific versions)
chmod +x NVIDIA-Linux-x86_64-535.129.03.run
./NVIDIA-Linux-x86_64-535.129.03.run --silent

# DKMS: automatically rebuilds driver on kernel updates
apt install nvidia-dkms-535

# CRITICAL: test driver after kernel updates
# A kernel update can break the NVIDIA module
# Add to your update playbook:
modprobe nvidia
nvidia-smi
# If nvidia-smi fails after kernel update, rebuild DKMS:
dkms autoinstall

3. GPU Scheduling in Kubernetes

Kubernetes doesn't natively understand GPUs. You need the NVIDIA device plugin:

# Install NVIDIA device plugin (DaemonSet)
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# Verify GPUs are visible to Kubernetes
kubectl describe node gpu-node-1 | grep nvidia.com/gpu
#  nvidia.com/gpu:   4    (allocatable)
#  nvidia.com/gpu:   4    (capacity)
# Request GPUs in a pod spec
apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  containers:
    - name: trainer
      image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
      resources:
        limits:
          nvidia.com/gpu: 2            # Request 2 GPUs
          memory: "64Gi"
          cpu: "16"
        requests:
          nvidia.com/gpu: 2
          memory: "64Gi"
          cpu: "16"
      volumeMounts:
        - name: model-storage
          mountPath: /models
        - name: shm                     # CRITICAL for PyTorch DataLoader
          mountPath: /dev/shm
  volumes:
    - name: model-storage
      persistentVolumeClaim:
        claimName: model-pvc
    - name: shm                          # Shared memory for data loading
      emptyDir:
        medium: Memory
        sizeLimit: "16Gi"
  nodeSelector:
    nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"

GPU scheduling is all-or-nothing by default: if a pod requests 1 GPU, it gets exclusive access to that entire GPU. No sharing, no overcommit.

Gotcha: Unlike CPU and memory, GPU requests and limits must be equal. You cannot request 0.5 GPUs or set requests: 1, limits: 2. GPUs are non-compressible, non-divisible integer resources in Kubernetes. The NVIDIA device plugin advertises whole GPUs only.

4. GPU Time-Slicing and MIG

Sharing GPUs between workloads:

Time-Slicing (temporal sharing):

# NVIDIA device plugin config for time-slicing
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin
  namespace: kube-system
data:
  config: |
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 4                # Each physical GPU appears as 4 virtual GPUs
Time-slicing shares the GPU by rapidly switching between workloads. No memory isolation — all workloads share the same GPU memory. OOM risk is high. Good for development, bad for production inference.

MIG (Multi-Instance GPU) — hardware partitioning on A100/H100:

# Enable MIG mode on GPU 0
nvidia-smi -i 0 -mig 1

# Create GPU instances (A100 80GB example)
# Split into 7 instances of ~10GB each:
nvidia-smi mig -i 0 -cgi 19,19,19,19,19,19,19 -C

# Or 3 instances of ~20GB + 1 instance of ~10GB:
nvidia-smi mig -i 0 -cgi 9,9,9,19 -C

# List MIG instances
nvidia-smi mig -lgi
nvidia-smi mig -lci

# Each MIG instance is a separate GPU resource in Kubernetes
# nvidia.com/mig-1g.10gb, nvidia.com/mig-2g.20gb, etc.

Name origin: MIG stands for Multi-Instance GPU. It was introduced with the NVIDIA A100 (Ampere architecture, 2020). The "instances" are hardware-level partitions with dedicated memory, cache, and compute units — not time-sliced virtualizations. Think of it as physically splitting one GPU into smaller independent GPUs.

MIG provides true memory isolation — one workload can't OOM another. Ideal for inference serving where each model needs a predictable slice of GPU memory.

5. Model Serving Infrastructure

Deploying trained models as API endpoints:

                    ┌─────────────────┐
                    │  Load Balancer  │
                    └────────┬────────┘
              ┌──────────────┼──────────────┐
              │              │              │
     ┌────────▼──┐  ┌───────▼───┐  ┌───────▼───┐
     │  vLLM     │  │  vLLM     │  │  vLLM     │
     │  replica  │  │  replica  │  │  replica  │
     │  (GPU 0)  │  │  (GPU 1)  │  │  (GPU 2)  │
     └───────────┘  └───────────┘  └───────────┘

vLLM (for LLM inference):

# Deploy vLLM serving Llama 2 7B
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-7b-chat-hf \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --port 8000

# Kubernetes deployment for vLLM
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "meta-llama/Llama-2-7b-chat-hf"
            - "--gpu-memory-utilization"
            - "0.90"
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: "32Gi"
            requests:
              nvidia.com/gpu: 1
              memory: "32Gi"
          ports:
            - containerPort: 8000
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120      # Models take time to load
            periodSeconds: 10
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc     # Cache downloaded models

NVIDIA Triton Inference Server (multi-framework):

# Triton supports TensorRT, ONNX, PyTorch, TensorFlow
# Model repository structure:
models/
├── resnet50/
   ├── config.pbtxt
   └── 1/
       └── model.onnx
└── bert-base/
    ├── config.pbtxt
    └── 1/
        └── model.pt

# Run Triton
docker run --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v /models:/models \
  nvcr.io/nvidia/tritonserver:23.10-py3 \
  tritonserver --model-repository=/models

6. Storage for Large Models

Models are big. LLaMA 2 70B is ~140GB. GPT-scale models are terabytes:

Storage Hierarchy for ML:
├── Model weights (read-heavy, large files)
│   ├── NFS/NAS: simple, shared across nodes
│   ├── S3/GCS + local cache: scalable, slower first load
│   └── PVC with ReadWriteMany: Kubernetes-native
├── Training data (read-heavy, massive)
│   ├── S3/GCS with streaming: don't copy to local
│   ├── Lustre/GPFS: parallel filesystem for HPC
│   └── NFS with SSD cache: good enough for small datasets
└── Checkpoints (write-heavy during training)
    ├── Local NVMe: fastest, lost on pod restart
    ├── PVC: persists across restarts
    └── S3 + periodic sync: durable, slower
# PVC for model storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache-pvc
spec:
  accessModes:
    - ReadWriteMany                     # Multiple pods read same models
  resources:
    requests:
      storage: 500Gi                    # Models are large
  storageClassName: nfs-client          # NFS for shared access

7. Monitoring GPU Workloads

# nvidia-smi is your primary GPU monitoring tool
nvidia-smi                              # Snapshot
nvidia-smi -l 1                         # Refresh every 1 second
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw --format=csv -l 5

# DCGM (Data Center GPU Manager) for Prometheus
# Install DCGM exporter as a DaemonSet
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring

# Key Prometheus metrics from DCGM:
# DCGM_FI_DEV_GPU_UTIL        — GPU compute utilization %
# DCGM_FI_DEV_FB_USED         — GPU memory used (MB)
# DCGM_FI_DEV_FB_FREE         — GPU memory free (MB)
# DCGM_FI_DEV_GPU_TEMP        — GPU temperature
# DCGM_FI_DEV_POWER_USAGE     — Power draw (watts)
# DCGM_FI_DEV_SM_CLOCK        — SM clock speed (throttling?)
# DCGM_FI_DEV_XID_ERRORS      — GPU hardware errors (XID)

Key alerts for GPU clusters:

Alert Condition Severity
GPU memory near limit Used > 90% of VRAM Warning
GPU temperature high > 83°C sustained Warning
GPU utilization low < 10% for 30min (wasting money) Info
XID errors detected Any XID error count > 0 Critical
CUDA OOM Pod restart with OOMKilled High
GPU not detected nvidia-smi fails Critical

8. Common GPU OOM Patterns

GPU memory (VRAM) OOM is different from system RAM OOM:

GPU OOM causes:
├── Batch size too large (most common)
│   Fix: reduce batch_size in training config
├── Model doesn't fit in GPU memory
│   Fix: use model parallelism, quantization, or a bigger GPU
├── Memory fragmentation (PyTorch)
│   Fix: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
├── Memory leak in training loop
│   Fix: ensure tensors are detached/deleted after use
└── Multiple users sharing GPU via time-slicing
    Fix: use MIG for memory isolation, or set per-user memory limits
# Monitor GPU memory in real-time during a job
watch -n 1 nvidia-smi

# Check if OOM killed a pod
kubectl describe pod <pod-name> | grep -A5 "Last State"
# Look for: "OOMKilled" in termination reason

# Check dmesg for CUDA/NVIDIA errors
dmesg | grep -i -E "nvrm|nvidia|cuda|xid"

Common Pitfalls

Debug clue: When nvidia-smi shows GPU utilization at 0% but the training job claims to be running, the most common cause is a CUDA version mismatch — the PyTorch build was compiled against a different CUDA version than the driver supports. Check with python -c "import torch; print(torch.cuda.is_available())" inside the container.

  • Not mounting /dev/shm properly. PyTorch DataLoader uses shared memory for multiprocess data loading. Default /dev/shm in Kubernetes is 64MB. A training job with 8 data workers will crash. Mount an emptyDir with medium: Memory at /dev/shm.
  • Forgetting that model loading takes minutes. A 70B parameter model takes 2-5 minutes to load into GPU memory. Set initialDelaySeconds on readiness probes accordingly. Scale-up is not instant — preload models during off-peak.
  • Ignoring GPU temperature under sustained load. Training jobs run for days. GPUs thermal-throttle above 83°C, silently reducing performance by 20-40%. Monitor temperature and ensure adequate cooling.
  • Scheduling GPU and non-GPU pods on the same node without taints. Without taints, Kubernetes will schedule CPU-only pods on your expensive GPU nodes, consuming CPU and memory that GPU workloads need for data loading. Taint GPU nodes and tolerate only GPU workloads.
  • Not caching downloaded models. Every pod restart re-downloads a 140GB model from HuggingFace. Use a PVC to cache model files. One download, many pod restarts.
  • Treating GPU nodes like CPU nodes for resource planning. A GPU node with 4x A100s and 64 CPU cores might only run 4 pods (one per GPU). The CPU and RAM are there to feed the GPUs, not to run additional workloads.

Wiki Navigation

Prerequisites