The Uncomfortable Truth About Multi-GPU
Most teams scale to multi-GPU too early. The instinct is understandable — your model doesn't fit in one GPU, so you rent eight. But multi-GPU training introduces communication overhead, synchronization bottlenecks, pipeline bubbles, and failure recovery complexity that can easily eat 20-40% of your theoretical throughput. Before you rent 8x H100s at $15-68/hr, make sure you've exhausted single-GPU optimizations.
8x H100 Pricing Across Providers
| Provider | 8x H100 $/hr | Monthly Cost | Interconnect |
|---|---|---|---|
| Cudo Compute | ~$14.96 | $10,771 | NVLink |
| Lambda Labs | ~$19.92 | $14,342 | NVLink |
| AWS (p5.48xlarge) | ~$67.68 | $48,730 | NVSwitch + EFA |
The gap between cheapest ($14.96/hr) and most expensive ($67.68/hr) is 4.5x for the exact same hardware. At scale, this choice determines whether your training run costs $10K or $50K.
NVLink vs PCIe: It Actually Matters
NVLink provides 900 GB/s bidirectional bandwidth between GPUs. PCIe Gen5 provides 128 GB/s. That's a 7x difference. For data parallelism with gradient all-reduce, the interconnect bandwidth directly determines how fast gradients sync across GPUs. For tensor parallelism (splitting a single model across GPUs), it determines the overhead of every forward and backward pass.
In practice, NVLink matters most for: large-batch training with frequent gradient syncs, tensor parallelism across GPUs, and any workload where GPUs need to communicate every few milliseconds. Data parallelism with large batch sizes and infrequent syncs can tolerate PCIe, but you'll still pay a 10-15% throughput penalty.
When You Actually Need Multi-GPU
- Model doesn't fit in single GPU VRAM: A 70B FP16 model needs ~140 GB — no single H100 can hold it. You need tensor parallelism across 2+ GPUs (or use a B200/MI300X).
- Training throughput is the bottleneck: If a single GPU takes 30 days to train your model, 8 GPUs with good scaling efficiency could reduce that to 4-5 days.
- You've already optimized single-GPU: Mixed precision, gradient accumulation, efficient data loading, compiled model — if you've done all of this and still need more, scale out.
When You Don't Need Multi-GPU
- LoRA/QLoRA fine-tuning: Fits on a single A100 80GB for models up to 70B. No need for multi-GPU.
- Models under 30B: A single H100 or A100 80GB handles training with gradient accumulation.
- Inference serving: Scale horizontally with independent GPU instances instead of multi-GPU parallelism.
The general rule: exhaust vertical scaling before going horizontal. Use quantization to fit models in fewer GPUs. Use gradient accumulation to simulate larger batches. Use a B200 (180 GB) or MI300X (192 GB) to avoid tensor parallelism entirely. Multi-GPU is the last resort, not the first.