What to Measure and Why

Memory Bandwidth

Bottleneck for LLM inference (memory-bound)

bandwidth-test / pytorch

Compute (TFLOPS)

Bottleneck for training (compute-bound)

cublas-bench / nvbench

Inference Throughput

Tokens/sec for your actual model

llm-bench / vllm

NVLink/PCIe bandwidth

Multi-GPU training efficiency

nccl-tests

Cloud GPUs with the same model name (e.g., "H100 SXM") can perform differently across providers due to thermal throttling, PCIe vs SXM, shared networking, or oversubscription. Always benchmark before committing to a long training run. This guide gives you the exact commands to measure real performance in 15 minutes.

Benchmark 1: Memory Bandwidth (Most Important for LLMs)

LLM inference is memory-bandwidth-bound. An H100 SXM has 3.35 TB/s theoretical bandwidth. If you're seeing less than 2.8 TB/s, the instance may be throttled:

# Install dependencies
pip install torch

# Quick memory bandwidth test (PyTorch)
python3 - << 'EOF'
import torch, time

# Allocate 10 GB tensors
a = torch.randn(1024, 1024, 1024, dtype=torch.float16, device='cuda')
b = torch.empty_like(a)

# Warm up
for _ in range(3):
    b.copy_(a)
torch.cuda.synchronize()

# Benchmark 10 iterations
N = 10
start = time.perf_counter()
for _ in range(N):
    b.copy_(a)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start

bytes_transferred = a.nbytes * 2 * N  # read + write
bandwidth_tbs = bytes_transferred / elapsed / 1e12
print(f"Memory bandwidth: {bandwidth_tbs:.2f} TB/s")
# H100 SXM expected: ~2.8–3.3 TB/s
# A100 SXM expected: ~1.8–2.0 TB/s
# RTX 4090 expected: ~0.85–0.95 TB/s
EOF

Benchmark 2: Compute Throughput (Training Speed)

python3 - << 'EOF'
import torch, time

# Matrix multiply benchmark (measures tensor core TFLOPS)
M, N, K = 4096, 4096, 4096
a = torch.randn(M, K, dtype=torch.float16, device='cuda')
b = torch.randn(K, N, dtype=torch.float16, device='cuda')

# Warm up
for _ in range(10):
    c = torch.mm(a, b)
torch.cuda.synchronize()

# Benchmark
N_iter = 100
start = time.perf_counter()
for _ in range(N_iter):
    c = torch.mm(a, b)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start

flops = 2 * M * N * K * N_iter
tflops = flops / elapsed / 1e12
print(f"Matrix multiply: {tflops:.1f} TFLOPS (FP16)")
# H100 SXM expected: ~250-280 TFLOPS (FP16 tensor core)
# A100 SXM expected: ~140-160 TFLOPS
# RTX 4090 expected: ~80-90 TFLOPS
EOF

Benchmark 3: LLM Inference Throughput (Real-World)

# Install vLLM
pip install vllm

# Run the built-in benchmark
python -m vllm.entrypoints.openai.api_server   --model meta-llama/Llama-3.1-8B-Instruct   --port 8000 &

sleep 30  # Wait for server to start

python -m vllm.benchmarks.benchmark_throughput   --backend openai   --endpoint http://localhost:8000   --model meta-llama/Llama-3.1-8B-Instruct   --num-prompts 200   --input-len 512   --output-len 256

# Reports: throughput (tokens/sec), TTFT, TPOT
# H100 SXM expected: ~8,000-12,000 tokens/sec (8B model)
# A100 SXM expected: ~4,000-6,000 tokens/sec
# RTX 4090 expected: ~2,000-3,000 tokens/sec

Benchmark 4: GPU-to-GPU Bandwidth (Multi-GPU Training)

# Install NCCL tests
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests && make

# Run all-reduce test (critical for multi-GPU training)
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2

# Key metric: "busbw" (bus bandwidth)
# H100 NVLink: ~450 GB/s busbw
# A100 NVLink: ~300 GB/s busbw
# PCIe only:   ~30-50 GB/s busbw (much slower for training)

Quick GPU Health Check

# Check thermals and clocks
nvidia-smi dmon -s pcvt -d 1 -c 5

# Check for ECC errors (bad VRAM = silent corruption)
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv,noheader,nounits

# Check GPU persistence mode (should be enabled for consistent perf)
sudo nvidia-smi -pm 1

# Get current clock speeds
nvidia-smi --query-gpu=clocks.current.graphics,clocks.current.memory --format=csv

Reference Numbers

GPU	BW (TB/s)	FP16 TFLOPS	LLaMA 8B t/s
H100 SXM	3.35	989	8,000–12,000
H200 SXM	4.8	989	14,000+
A100 80GB SXM	2.0	312	4,000–6,000
RTX 4090	1.0	165	2,000–3,000
L40S	0.86	362	1,500–2,500

→ H100 SXM Prices → A100 80GB → H200 Instances → Compare All Providers

How to Benchmark Cloud GPUs: Measure What Matters

Benchmark 1: Memory Bandwidth (Most Important for LLMs)

Benchmark 2: Compute Throughput (Training Speed)

Benchmark 3: LLM Inference Throughput (Real-World)

Benchmark 4: GPU-to-GPU Bandwidth (Multi-GPU Training)

Quick GPU Health Check

Reference Numbers

Related Articles

Elon Web Services: What SpaceX's $15B Anthropic Deal Means for Cloud GPU Pricing

How to Run Llama 4 Locally (Scout + Maverick)

How to Run DeepSeek R1 Locally (No GPU Required)