GPU Best Practices

Core Principles

Measure before you scale. Always take a short, single‑GPU baseline and record simple metrics.
Right‑size, don’t over‑ask. Request only the GPUs/CPUs/RAM and walltime your measurements justify.
Keep GPUs busy. If utilization is low, fix input/data issues before adding more GPUs.
Short interactive, long batch. Use OOD for quick experiments; move long work to SLURM.
Be a good citizen. Release idle sessions, clean up scratch, and prefer storage patterns that reduce system load.

Baseline (≤5 minutes): Run a tiny slice on 1 GPU. Note:
- Throughput (samples/s or tokens/s)
- GPU utilization and memory usage
- Any stalls from CPU or I/O
Find the knee: Increase batch size and enable mixed precision if supported. Stop when throughput stops improving or memory is exhausted.
Unblock the pipeline: If GPUs are idle, tune the data path (fewer tiny files, local scratch, sensible parallelism). Re‑measure.
Decide to scale: Add GPUs only if the job speeds up near linearly. If efficiency drops sharply, stay smaller.
Set requests: Convert measured step/epoch time into walltime with a modest buffer. Request CPUs/RAM in proportion to actual need.

Open OnDemand (interactive): Fast prototyping, visualization, short tests. Keep sessions short and purposeful. Stop them when not actively using the GPU.
SLURM batch (production): Long or repeatable runs, sweeps, or multi‑GPU/multi‑node training. Log metrics and use checkpoints so work can resume if preempted or time‑limited.

Start single‑GPU. Only scale once your single‑GPU run is clearly compute‑bound.
Scale gradually. Test 1→2→4→8 GPUs. Keep going only if speedup per added GPU remains healthy.
Stop when it flattens. If added GPUs don’t meaningfully reduce time‑to‑result, you’ve hit bottlenecks (I/O, synchronization, model limits).
Small, concurrent jobs: Consider partitioning large GPUs (e.g., MIG/MPS where available) for many lightweight inference tasks.

During runs: Periodically check utilization, memory, and throughput trend. Watch for drift or stalls.
After runs: Capture a simple summary (config + metrics). Keep what helps you compare future runs; delete bulky, unused artifacts.
Storage: Use scratch/local fast storage for hot data and checkpoints. Avoid floods of tiny files; favor chunked/sharded formats.

Right‑size by evidence. Measurements justify requests—nothing more, nothing less.
Release promptly. End idle OOD sessions and cancel stuck jobs.
Share context. When asking for help, include a short description of the task, resources requested, and your key measurements.

What you tried (one sentence) and the goal.
Your baseline measurements (throughput, GPU util/memory) and any obvious bottleneck.
What you requested (GPUs/CPUs/RAM/time) and whether the job finished or stalled.