Core Principles
- Measure before you scale. Always take a short, single‑GPU baseline and record simple metrics.
- Right‑size, don’t over‑ask. Request only the GPUs/CPUs/RAM and walltime your measurements justify.
- Keep GPUs busy. If utilization is low, fix input/data issues before adding more GPUs.
- Short interactive, long batch. Use OOD for quick experiments; move long work to SLURM.
- Be a good citizen. Release idle sessions, clean up scratch, and prefer storage patterns that reduce system load.
Right‑Sizing in 5 Steps
- Baseline (≤5 minutes): Run a tiny slice on 1 GPU. Note:
- Throughput (samples/s or tokens/s)
- GPU utilization and memory usage
- Any stalls from CPU or I/O
- Find the knee: Increase batch size and enable mixed precision if supported. Stop when throughput stops improving or memory is exhausted.
- Unblock the pipeline: If GPUs are idle, tune the data path (fewer tiny files, local scratch, sensible parallelism). Re‑measure.
- Decide to scale: Add GPUs only if the job speeds up near linearly. If efficiency drops sharply, stay smaller.
- Set requests: Convert measured step/epoch time into walltime with a modest buffer. Request CPUs/RAM in proportion to actual need.
OOD vs SLURM (When to Use What)
- Open OnDemand (interactive): Fast prototyping, visualization, short tests. Keep sessions short and purposeful. Stop them when not actively using the GPU.
- SLURM batch (production): Long or repeatable runs, sweeps, or multi‑GPU/multi‑node training. Log metrics and use checkpoints so work can resume if preempted or time‑limited.
Scaling & Efficiency (Rules of Thumb)
- Start single‑GPU. Only scale once your single‑GPU run is clearly compute‑bound.
- Scale gradually. Test 1→2→4→8 GPUs. Keep going only if speedup per added GPU remains healthy.
- Stop when it flattens. If added GPUs don’t meaningfully reduce time‑to‑result, you’ve hit bottlenecks (I/O, synchronization, model limits).
- Small, concurrent jobs: Consider partitioning large GPUs (e.g., MIG/MPS where available) for many lightweight inference tasks.
Monitoring & Hygiene
- During runs: Periodically check utilization, memory, and throughput trend. Watch for drift or stalls.
- After runs: Capture a simple summary (config + metrics). Keep what helps you compare future runs; delete bulky, unused artifacts.
- Storage: Use scratch/local fast storage for hot data and checkpoints. Avoid floods of tiny files; favor chunked/sharded formats.
Resource Etiquette
- Right‑size by evidence. Measurements justify requests—nothing more, nothing less.
- Release promptly. End idle OOD sessions and cancel stuck jobs.
- Share context. When asking for help, include a short description of the task, resources requested, and your key measurements.
Quick Decision Guide
- New idea? → OOD short session → measure.
- Throughput improving with batch/precision? → keep tuning until the knee.
- GPU still idle? → fix data/I/O before scaling.
- Near‑linear speedup when adding GPUs? → scale; else, stay smaller.
- Run >1 hour or needs a queue slot? → SLURM batch with checkpoints.
What to Report When Seeking Help
- What you tried (one sentence) and the goal.
- Your baseline measurements (throughput, GPU util/memory) and any obvious bottleneck.
- What you requested (GPUs/CPUs/RAM/time) and whether the job finished or stalled.
|
userinfo
afton, gpu, hpc, queues, rivanna, slurm, supercomputer, utilization