Multi-GPU Training: When 1 GPU Isn't Enough (And When It Is)

The Uncomfortable Truth About Multi-GPU

Most teams scale to multi-GPU too early. The instinct is understandable — your model doesn't fit in one GPU, so you rent eight. But multi-GPU training introduces communication overhead, synchronization bottlenecks, pipeline bubbles, and failure recovery complexity that can easily eat 20-40% of your theoretical throughput. Before you rent 8x H100s at $15-68/hr, make sure you've exhausted single-GPU optimizations.

8x H100 Pricing Across Providers

Provider	8x H100 $/hr	Monthly Cost	Interconnect
Cudo Compute	~$14.96	$10,771	NVLink
Lambda Labs	~$19.92	$14,342	NVLink
AWS (p5.48xlarge)	~$67.68	$48,730	NVSwitch + EFA

The gap between cheapest ($14.96/hr) and most expensive ($67.68/hr) is 4.5x for the exact same hardware. At scale, this choice determines whether your training run costs $10K or $50K.

NVLink vs PCIe: It Actually Matters

NVLink provides 900 GB/s bidirectional bandwidth between GPUs. PCIe Gen5 provides 128 GB/s. That's a 7x difference. For data parallelism with gradient all-reduce, the interconnect bandwidth directly determines how fast gradients sync across GPUs. For tensor parallelism (splitting a single model across GPUs), it determines the overhead of every forward and backward pass.

In practice, NVLink matters most for: large-batch training with frequent gradient syncs, tensor parallelism across GPUs, and any workload where GPUs need to communicate every few milliseconds. Data parallelism with large batch sizes and infrequent syncs can tolerate PCIe, but you'll still pay a 10-15% throughput penalty.

When You Actually Need Multi-GPU

Model doesn't fit in single GPU VRAM: A 70B FP16 model needs ~140 GB — no single H100 can hold it. You need tensor parallelism across 2+ GPUs (or use a B200/MI300X).
Training throughput is the bottleneck: If a single GPU takes 30 days to train your model, 8 GPUs with good scaling efficiency could reduce that to 4-5 days.
You've already optimized single-GPU: Mixed precision, gradient accumulation, efficient data loading, compiled model — if you've done all of this and still need more, scale out.

When You Don't Need Multi-GPU

LoRA/QLoRA fine-tuning: Fits on a single A100 80GB for models up to 70B. No need for multi-GPU.
Models under 30B: A single H100 or A100 80GB handles training with gradient accumulation.
Inference serving: Scale horizontally with independent GPU instances instead of multi-GPU parallelism.

The general rule: exhaust vertical scaling before going horizontal. Use quantization to fit models in fewer GPUs. Use gradient accumulation to simulate larger batches. Use a B200 (180 GB) or MI300X (192 GB) to avoid tensor parallelism entirely. Multi-GPU is the last resort, not the first.

Multi-GPU Training: When 1 GPU Isn't Enough (And When It Is)

The Uncomfortable Truth About Multi-GPU

8x H100 Pricing Across Providers

NVLink vs PCIe: It Actually Matters

When You Actually Need Multi-GPU

When You Don't Need Multi-GPU

Related Articles

How to Run Llama 4 Locally (Scout + Maverick)

How to Run DeepSeek R1 Locally (No GPU Required)

How to Run Gemma 4 Locally (Text, Audio, Image)