Quiz: AI/ML Infrastructure Ops¶

3 questions

L1 (1 questions)¶

1. Why is mounting /dev/shm as an emptyDir with medium: Memory critical for PyTorch training jobs in Kubernetes?

Show answer

PyTorch DataLoader uses shared memory (/dev/shm) for multiprocess data loading. The default /dev/shm in Kubernetes is only 64MB. A training job with multiple data workers will crash when it exceeds this limit. Mounting an emptyDir with medium: Memory provides adequate shared memory backed by RAM.

L2 (1 questions)¶

1. What is the difference between GPU time-slicing and MIG (Multi-Instance GPU), and when would you use each?

Show answer

Time-slicing shares a GPU by rapidly switching between workloads but provides no memory isolation — all workloads share GPU memory, creating OOM risk. MIG (available on A100/H100) partitions the GPU into hardware-isolated instances with dedicated memory. Use time-slicing for development environments where cost matters more than isolation. Use MIG for production inference where each model needs predictable, isolated GPU memory.

L3 (1 questions)¶

1. A GPU training job shows low GPU utilization (15%) despite requesting a full A100. What are three likely causes and how do you diagnose each?

Show answer

1. Data loading bottleneck — the CPU/disk can't feed data fast enough. Diagnose with nvidia-smi showing GPU idle between bursts, and check CPU utilization and disk I/O. Fix with more DataLoader workers or faster storage.
2. Small batch size — GPU finishes quickly and waits. Diagnose by correlating batch size with utilization. Fix by increasing batch size.
3. PCIe bottleneck — data transfer between CPU and GPU is the bottleneck. Diagnose with nvidia-smi showing low PCIe throughput. Fix by using NVLink for multi-GPU or ensuring PCIe Gen4 bandwidth.