Skip to main content
trainingmulti-gpuguide

Multi-GPU Training: When 1 GPU Isn't Enough (And When It Is)

8x H100s cost $16/hr+. Before scaling to multi-GPU, make sure you actually need it. We break down the math for every model size.

February 5, 202512 min read

The Uncomfortable Truth About Multi-GPU

Most teams scale to multi-GPU too early. The instinct is understandable — your model doesn't fit in one GPU, so you rent eight. But multi-GPU training introduces communication overhead, synchronization bottlenecks, pipeline bubbles, and failure recovery complexity that can easily eat 20-40% of your theoretical throughput. Before you rent 8x H100s at $15-68/hr, make sure you've exhausted single-GPU optimizations.

8x H100 Pricing Across Providers

Provider8x H100 $/hrMonthly CostInterconnect
Cudo Compute~$14.96$10,771NVLink
Lambda Labs~$19.92$14,342NVLink
AWS (p5.48xlarge)~$67.68$48,730NVSwitch + EFA

The gap between cheapest ($14.96/hr) and most expensive ($67.68/hr) is 4.5x for the exact same hardware. At scale, this choice determines whether your training run costs $10K or $50K.

NVLink vs PCIe: It Actually Matters

NVLink provides 900 GB/s bidirectional bandwidth between GPUs. PCIe Gen5 provides 128 GB/s. That's a 7x difference. For data parallelism with gradient all-reduce, the interconnect bandwidth directly determines how fast gradients sync across GPUs. For tensor parallelism (splitting a single model across GPUs), it determines the overhead of every forward and backward pass.

In practice, NVLink matters most for: large-batch training with frequent gradient syncs, tensor parallelism across GPUs, and any workload where GPUs need to communicate every few milliseconds. Data parallelism with large batch sizes and infrequent syncs can tolerate PCIe, but you'll still pay a 10-15% throughput penalty.

When You Actually Need Multi-GPU

  • Model doesn't fit in single GPU VRAM: A 70B FP16 model needs ~140 GB — no single H100 can hold it. You need tensor parallelism across 2+ GPUs (or use a B200/MI300X).
  • Training throughput is the bottleneck: If a single GPU takes 30 days to train your model, 8 GPUs with good scaling efficiency could reduce that to 4-5 days.
  • You've already optimized single-GPU: Mixed precision, gradient accumulation, efficient data loading, compiled model — if you've done all of this and still need more, scale out.

When You Don't Need Multi-GPU

  • LoRA/QLoRA fine-tuning: Fits on a single A100 80GB for models up to 70B. No need for multi-GPU.
  • Models under 30B: A single H100 or A100 80GB handles training with gradient accumulation.
  • Inference serving: Scale horizontally with independent GPU instances instead of multi-GPU parallelism.

The general rule: exhaust vertical scaling before going horizontal. Use quantization to fit models in fewer GPUs. Use gradient accumulation to simulate larger batches. Use a B200 (180 GB) or MI300X (192 GB) to avoid tensor parallelism entirely. Multi-GPU is the last resort, not the first.

Stay ahead on GPU pricing

Get weekly GPU price reports, new hardware analysis, and cost optimization tips. Join engineers and researchers who save thousands on cloud compute.

No spam. Unsubscribe anytime. We respect your inbox.

Find the cheapest GPU for your workload

Compare real-time prices across tracked cloud providers and marketplaces with 5,000+ instances. Updated every 6 hours.

Compare GPU Prices →

Related Articles