Cloud GPUs with the same model name (e.g., "H100 SXM") can perform differently across providers due to thermal throttling, PCIe vs SXM, shared networking, or oversubscription. Always benchmark before committing to a long training run. This guide gives you the exact commands to measure real performance in 15 minutes.
Benchmark 1: Memory Bandwidth (Most Important for LLMs)
LLM inference is memory-bandwidth-bound. An H100 SXM has 3.35 TB/s theoretical bandwidth. If you're seeing less than 2.8 TB/s, the instance may be throttled:
# Install dependencies
pip install torch
# Quick memory bandwidth test (PyTorch)
python3 - << 'EOF'
import torch, time
# Allocate 10 GB tensors
a = torch.randn(1024, 1024, 1024, dtype=torch.float16, device='cuda')
b = torch.empty_like(a)
# Warm up
for _ in range(3):
b.copy_(a)
torch.cuda.synchronize()
# Benchmark 10 iterations
N = 10
start = time.perf_counter()
for _ in range(N):
b.copy_(a)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
bytes_transferred = a.nbytes * 2 * N # read + write
bandwidth_tbs = bytes_transferred / elapsed / 1e12
print(f"Memory bandwidth: {bandwidth_tbs:.2f} TB/s")
# H100 SXM expected: ~2.8–3.3 TB/s
# A100 SXM expected: ~1.8–2.0 TB/s
# RTX 4090 expected: ~0.85–0.95 TB/s
EOFBenchmark 2: Compute Throughput (Training Speed)
python3 - << 'EOF'
import torch, time
# Matrix multiply benchmark (measures tensor core TFLOPS)
M, N, K = 4096, 4096, 4096
a = torch.randn(M, K, dtype=torch.float16, device='cuda')
b = torch.randn(K, N, dtype=torch.float16, device='cuda')
# Warm up
for _ in range(10):
c = torch.mm(a, b)
torch.cuda.synchronize()
# Benchmark
N_iter = 100
start = time.perf_counter()
for _ in range(N_iter):
c = torch.mm(a, b)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
flops = 2 * M * N * K * N_iter
tflops = flops / elapsed / 1e12
print(f"Matrix multiply: {tflops:.1f} TFLOPS (FP16)")
# H100 SXM expected: ~250-280 TFLOPS (FP16 tensor core)
# A100 SXM expected: ~140-160 TFLOPS
# RTX 4090 expected: ~80-90 TFLOPS
EOFBenchmark 3: LLM Inference Throughput (Real-World)
# Install vLLM
pip install vllm
# Run the built-in benchmark
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --port 8000 &
sleep 30 # Wait for server to start
python -m vllm.benchmarks.benchmark_throughput --backend openai --endpoint http://localhost:8000 --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 200 --input-len 512 --output-len 256
# Reports: throughput (tokens/sec), TTFT, TPOT
# H100 SXM expected: ~8,000-12,000 tokens/sec (8B model)
# A100 SXM expected: ~4,000-6,000 tokens/sec
# RTX 4090 expected: ~2,000-3,000 tokens/secBenchmark 4: GPU-to-GPU Bandwidth (Multi-GPU Training)
# Install NCCL tests
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests && make
# Run all-reduce test (critical for multi-GPU training)
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2
# Key metric: "busbw" (bus bandwidth)
# H100 NVLink: ~450 GB/s busbw
# A100 NVLink: ~300 GB/s busbw
# PCIe only: ~30-50 GB/s busbw (much slower for training)Quick GPU Health Check
# Check thermals and clocks
nvidia-smi dmon -s pcvt -d 1 -c 5
# Check for ECC errors (bad VRAM = silent corruption)
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv,noheader,nounits
# Check GPU persistence mode (should be enabled for consistent perf)
sudo nvidia-smi -pm 1
# Get current clock speeds
nvidia-smi --query-gpu=clocks.current.graphics,clocks.current.memory --format=csvReference Numbers
| GPU | BW (TB/s) | FP16 TFLOPS | LLaMA 8B t/s |
|---|---|---|---|
| H100 SXM | 3.35 | 989 | 8,000–12,000 |
| H200 SXM | 4.8 | 989 | 14,000+ |
| A100 80GB SXM | 2.0 | 312 | 4,000–6,000 |
| RTX 4090 | 1.0 | 165 | 2,000–3,000 |
| L40S | 0.86 | 362 | 1,500–2,500 |