Ml Ops¶

17 cards — 🟢 3 easy | 🟡 4 medium | 🔴 3 hard

🟢 Easy (3)¶

1. What are the six key GPU metrics that ops engineers need to monitor?

Show answer

GPU utilization (%), GPU memory used (GB), GPU temperature (C), GPU power draw (W), PCIe throughput (GB/s), and NVLink throughput (GB/s). Use nvidia-smi for snapshots and DCGM exporter with Prometheus for continuous monitoring.

Remember: MLOps = DevOps for ML. Training, versioning, deployment, monitoring. "CI/CD for models."

Debug clue: `nvidia-smi dmon -s pucvmet` gives a continuous monitoring stream of all six metrics in one command.

2. What are the four layers of the CUDA stack from top to bottom, and why is version compatibility critical?

Show answer

Application (PyTorch/TensorFlow) -> CUDA Toolkit (nvcc, cuBLAS, cuDNN) -> CUDA Driver (nvidia.ko kernel module) -> GPU Hardware. Compatibility is strict: toolkit version must match the application build, driver must be >= toolkit version, driver must support the GPU generation, and kernel version must be compatible with the driver.

Remember: MLOps = DevOps for ML. Training, versioning, deployment, monitoring. "CI/CD for models."

Analogy: The CUDA stack is like a tower — app at the top, hardware at the bottom. Each layer must be compatible with its neighbors or the whole tower falls.

3. How does GPU scheduling work in Kubernetes by default?

Show answer

Kubernetes doesn't natively understand GPUs — you need the NVIDIA device plugin (a DaemonSet). GPU scheduling is all-or-nothing: if a pod requests 1 GPU, it gets exclusive access to that entire GPU. No sharing, no overcommit. Verify GPUs with kubectl describe node | grep nvidia.com/gpu.

Remember: MLOps = DevOps for ML. Training, versioning, deployment, monitoring. "CI/CD for models."

Gotcha: GPUs cannot be overcommitted like CPU. If you request 1 GPU, you get the whole GPU. Plan node sizing accordingly.

🟡 Medium (4)¶

1. What is the difference between GPU time-slicing and MIG (Multi-Instance GPU)?

Show answer

Time-slicing shares a GPU by rapidly switching between workloads — no memory isolation, all workloads share the same VRAM, high OOM risk. Good for development, bad for production. MIG (available on A100/H100) provides hardware-level partitioning with true memory isolation — one workload can't OOM another. MIG is ideal for inference serving where each model needs a predictable memory slice.

Remember: MLOps = DevOps for ML. Training, versioning, deployment, monitoring. "CI/CD for models."

Name origin: MIG = Multi-Instance GPU. Introduced with NVIDIA A100 (Ampere architecture, 2020).

2. What is vLLM and how do you configure its Kubernetes deployment for production?

Show answer

vLLM is an LLM inference server. Key deployment considerations: set gpu-memory-utilization (e.g., 0.90), mount a PVC for model cache (avoid re-downloading 140GB models on every restart), set initialDelaySeconds on readiness probes to 120+ seconds (models take minutes to load), mount /dev/shm as emptyDir with medium: Memory for PyTorch DataLoader.

Remember: MLOps = DevOps for ML. Training, versioning, deployment, monitoring. "CI/CD for models."

Name origin: vLLM stands for "virtual Large Language Model" — it uses PagedAttention to efficiently manage GPU memory for LLM inference.

3. What are the five common causes of GPU memory (VRAM) OOM errors and their fixes?

Show answer

(1) Batch size too large — reduce batch_size. (2) Model doesn't fit in GPU memory — use model parallelism, quantization, or bigger GPU. (3) Memory fragmentation in PyTorch — set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. (4) Memory leak in training loop — ensure tensors are detached/deleted. (5) Multiple users sharing GPU via time-slicing — use MIG for memory isolation.

Remember: MLOps = DevOps for ML. Training, versioning, deployment, monitoring. "CI/CD for models."

Debug clue: `nvidia-smi` shows per-GPU memory usage. `torch.cuda.memory_summary()` in Python shows PyTorch\'s allocation breakdown.

4. What are the three categories of ML storage needs and appropriate solutions for each?

Show answer

(1) Model weights (read-heavy, large files): NFS/NAS, S3 with local cache, or ReadWriteMany PVCs. (2) Training data (read-heavy, massive): S3/GCS with streaming, Lustre/GPFS for HPC, or NFS with SSD cache. (3) Checkpoints (write-heavy during training): local NVMe for speed, PVC for persistence, or S3 with periodic sync for durability.

Remember: MLOps = DevOps for ML. Training, versioning, deployment, monitoring. "CI/CD for models."

Remember: "Models = read-heavy, Training data = massive reads, Checkpoints = write-heavy." Each needs different storage characteristics.

🔴 Hard (3)¶

1. Why is mounting /dev/shm critical for PyTorch training jobs in Kubernetes, and what happens if you forget?

Show answer

PyTorch DataLoader uses shared memory for multiprocess data loading. Default /dev/shm in Kubernetes is only 64MB. A training job with 8 data workers will crash because it exceeds this limit. Fix: mount an emptyDir with medium: Memory at /dev/shm with an appropriate sizeLimit (e.g., 16Gi).

Remember: MLOps = DevOps for ML. Training, versioning, deployment, monitoring. "CI/CD for models."

Gotcha: The default 64MB /dev/shm in Kubernetes is a silent killer for ML workloads. Always mount an emptyDir with `medium: Memory`.

2. Why should GPU nodes be tainted in Kubernetes, and what happens without taints?

Show answer

Without taints, Kubernetes will schedule CPU-only pods on expensive GPU nodes, consuming CPU and memory that GPU workloads need for data loading. A GPU node with 4x A100s and 64 CPU cores might only run 4 pods (one per GPU), and the CPU/RAM exists to feed the GPUs. Taint GPU nodes and tolerate only GPU workloads to prevent resource waste on $10K-$40K/card hardware.

Remember: MLOps = DevOps for ML. Training, versioning, deployment, monitoring. "CI/CD for models."

Remember: "Taint + tolerate = GPU reservation." Without taints, cheap CPU pods consume expensive GPU node resources.

3. What are the six critical alerts to configure for a GPU cluster and their severity levels?

Show answer

GPU memory > 90% VRAM (Warning), GPU temperature > 83C sustained (Warning — thermal throttling reduces performance 20-40%), GPU utilization < 10% for 30 min (Info — wasting money), XID errors detected (Critical — hardware errors), CUDA OOM pod restart (High), GPU not detected / nvidia-smi fails (Critical). Use DCGM exporter Prometheus metrics for alerting.

Remember: MLOps = DevOps for ML. Training, versioning, deployment, monitoring. "CI/CD for models."

Remember: "DCGM exporter = GPU Prometheus metrics." It exposes nvidia_gpu_* metrics for Grafana dashboards and alerting.