Skip to main content
guidehardwareanalysis

How to Benchmark Cloud GPUs: Measure What Matters

Benchmark memory bandwidth, TFLOPS, and inference throughput on any cloud GPU. vLLM and NCCL test scripts.

April 10, 202610 min read
What to Measure and Why
Memory Bandwidth
Bottleneck for LLM inference (memory-bound)
bandwidth-test / pytorch
Compute (TFLOPS)
Bottleneck for training (compute-bound)
cublas-bench / nvbench
Inference Throughput
Tokens/sec for your actual model
llm-bench / vllm
NVLink/PCIe bandwidth
Multi-GPU training efficiency
nccl-tests

Cloud GPUs with the same model name (e.g., "H100 SXM") can perform differently across providers due to thermal throttling, PCIe vs SXM, shared networking, or oversubscription. Always benchmark before committing to a long training run. This guide gives you the exact commands to measure real performance in 15 minutes.

Benchmark 1: Memory Bandwidth (Most Important for LLMs)

LLM inference is memory-bandwidth-bound. An H100 SXM has 3.35 TB/s theoretical bandwidth. If you're seeing less than 2.8 TB/s, the instance may be throttled:

# Install dependencies
pip install torch

# Quick memory bandwidth test (PyTorch)
python3 - << 'EOF'
import torch, time

# Allocate 10 GB tensors
a = torch.randn(1024, 1024, 1024, dtype=torch.float16, device='cuda')
b = torch.empty_like(a)

# Warm up
for _ in range(3):
    b.copy_(a)
torch.cuda.synchronize()

# Benchmark 10 iterations
N = 10
start = time.perf_counter()
for _ in range(N):
    b.copy_(a)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start

bytes_transferred = a.nbytes * 2 * N  # read + write
bandwidth_tbs = bytes_transferred / elapsed / 1e12
print(f"Memory bandwidth: {bandwidth_tbs:.2f} TB/s")
# H100 SXM expected: ~2.8–3.3 TB/s
# A100 SXM expected: ~1.8–2.0 TB/s
# RTX 4090 expected: ~0.85–0.95 TB/s
EOF

Benchmark 2: Compute Throughput (Training Speed)

python3 - << 'EOF'
import torch, time

# Matrix multiply benchmark (measures tensor core TFLOPS)
M, N, K = 4096, 4096, 4096
a = torch.randn(M, K, dtype=torch.float16, device='cuda')
b = torch.randn(K, N, dtype=torch.float16, device='cuda')

# Warm up
for _ in range(10):
    c = torch.mm(a, b)
torch.cuda.synchronize()

# Benchmark
N_iter = 100
start = time.perf_counter()
for _ in range(N_iter):
    c = torch.mm(a, b)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start

flops = 2 * M * N * K * N_iter
tflops = flops / elapsed / 1e12
print(f"Matrix multiply: {tflops:.1f} TFLOPS (FP16)")
# H100 SXM expected: ~250-280 TFLOPS (FP16 tensor core)
# A100 SXM expected: ~140-160 TFLOPS
# RTX 4090 expected: ~80-90 TFLOPS
EOF

Benchmark 3: LLM Inference Throughput (Real-World)

# Install vLLM
pip install vllm

# Run the built-in benchmark
python -m vllm.entrypoints.openai.api_server   --model meta-llama/Llama-3.1-8B-Instruct   --port 8000 &

sleep 30  # Wait for server to start

python -m vllm.benchmarks.benchmark_throughput   --backend openai   --endpoint http://localhost:8000   --model meta-llama/Llama-3.1-8B-Instruct   --num-prompts 200   --input-len 512   --output-len 256

# Reports: throughput (tokens/sec), TTFT, TPOT
# H100 SXM expected: ~8,000-12,000 tokens/sec (8B model)
# A100 SXM expected: ~4,000-6,000 tokens/sec
# RTX 4090 expected: ~2,000-3,000 tokens/sec

Benchmark 4: GPU-to-GPU Bandwidth (Multi-GPU Training)

# Install NCCL tests
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests && make

# Run all-reduce test (critical for multi-GPU training)
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2

# Key metric: "busbw" (bus bandwidth)
# H100 NVLink: ~450 GB/s busbw
# A100 NVLink: ~300 GB/s busbw
# PCIe only:   ~30-50 GB/s busbw (much slower for training)

Quick GPU Health Check

# Check thermals and clocks
nvidia-smi dmon -s pcvt -d 1 -c 5

# Check for ECC errors (bad VRAM = silent corruption)
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv,noheader,nounits

# Check GPU persistence mode (should be enabled for consistent perf)
sudo nvidia-smi -pm 1

# Get current clock speeds
nvidia-smi --query-gpu=clocks.current.graphics,clocks.current.memory --format=csv

Reference Numbers

GPUBW (TB/s)FP16 TFLOPSLLaMA 8B t/s
H100 SXM3.359898,000–12,000
H200 SXM4.898914,000+
A100 80GB SXM2.03124,000–6,000
RTX 40901.01652,000–3,000
L40S0.863621,500–2,500

Stay ahead on GPU pricing

Get weekly GPU price reports, new hardware analysis, and cost optimization tips. Join engineers and researchers who save thousands on cloud compute.

No spam. Unsubscribe anytime. We respect your inbox.

Find the cheapest GPU for your workload

Compare real-time prices across tracked cloud providers and marketplaces with 5,000+ instances. Updated every 6 hours.

Compare GPU Prices →

Related Articles

We use cookies for analytics and to remember your preferences. Privacy Policy