Here's something the GPU cloud marketing pages won't tell you: for LLM inference, the GPU you need is almost entirely determined by one number — how much VRAM your model requires. Not TFLOPS. Not tensor core count. Not the generation of the GPU. Just VRAM, and secondarily, memory bandwidth. If you understand these two things, you'll stop overpaying for inference by 3–5x, which is what most teams are doing right now.
Why Inference Is Not Training
Training and inference are fundamentally different workloads that stress different parts of the GPU. Training is compute-bound: you're doing massive matrix multiplications across large batches of data, and the GPU's raw TFLOPS throughput determines how fast you go. This is why the H100's FP8 tensor cores make a real difference for training — they deliver 2–3x more compute per second than the A100.
Inference is different. Autoregressive text generation — which is how every LLM from GPT to LLaMA generates output — produces one token at a time. Each token requires loading the entire model's weights from VRAM, doing a relatively small computation, then writing the result back. The bottleneck is not compute; it's how fast you can shuttle data between VRAM and the GPU cores. This makes inference a memory-bandwidth-bound workload, and buying the most expensive GPU is almost always wrong.
Step 1: Figure Out Your VRAM Requirement
The first and most important question is: how much VRAM do you need to fit your model? If your model doesn't fit in VRAM, it doesn't matter how fast the GPU is — it won't work (or it'll swap to system RAM and be 100x slower). Here are the real numbers:
| Model Size | VRAM (FP16) | Best GPU | Price From |
|---|---|---|---|
| 7B params | ~14GB | RTX 4090 (24GB) | $0.39/hr |
| 13B params | ~26GB | L40S (48GB) | $0.88/hr |
| 30B params | ~60GB | A100 80GB | $0.34/hr |
| 70B params | ~140GB | H200 (141GB) or 2x H100 | $1.84/hr or ~$3.74/hr |
A quick rule of thumb: multiply the parameter count in billions by 2 to get the approximate FP16 VRAM requirement in gigabytes. A 7B model needs roughly 14GB. A 70B model needs roughly 140GB. If you're using quantized models (INT8, INT4), you can cut that number in half or by 75%, which opens up cheaper GPU options.
Step 2: Maximize Memory Bandwidth Per Dollar
Once you know your VRAM requirement and have a shortlist of GPUs that can fit your model, the next question is: which one gives you the most memory bandwidth per dollar? Memory bandwidth directly determines your token generation speed for single-request inference.
| GPU | VRAM | Bandwidth | Price From |
|---|---|---|---|
| H200 | 141GB HBM3e | 4.8 TB/s | $1.84/hr |
| H100 | 80GB HBM3 | 3.35 TB/s | $1.87/hr |
| A100 80GB | 80GB HBM2e | 2 TB/s | $0.34/hr |
| RTX 4090 | 24GB GDDR6X | 1 TB/s | $0.39/hr |
Look at that table carefully. The RTX 4090 at $0.39/hr gives you 1 TB/s of bandwidth. The A100 80GB at $0.34/hr gives you 2 TB/s. For a 7B model that fits on either card, the A100 will generate tokens roughly 2x faster and costs less per hour. But if you factor in VRAM efficiency — the 4090 has 24GB vs the A100's 80GB — the 4090 is dramatically cheaper for small models where you don't need that extra VRAM.
The H200 is the inference king. At $1.84/hr, it offers 141GB of HBM3e VRAM (enough for a 70B model in a single GPU) and 4.8 TB/s of memory bandwidth. That's nearly 2.5x the bandwidth of an A100 at roughly 5x the price. For high-throughput 70B inference, the math works out: one H200 replaces two A100 80GBs and is faster than both of them combined.
Step 3: Single Request vs. Batch Inference
Everything above assumes single-request inference, which is the typical use case for chatbots and real-time applications. But if you're doing batch inference — processing thousands of prompts in parallel — the dynamics change completely.
Batch inference is compute-bound, not memory-bandwidth-bound. When you batch 32 or 64 requests together, you're doing matrix multiplications across all of them simultaneously, and the GPU's raw TFLOPS become the bottleneck. In this regime, the H100's 3x compute advantage over the A100 actually matters. If you're running an inference service at high utilization with continuous batching (like vLLM or TGI), the H100 can serve more tokens per dollar than an A100 because it's saturating compute, not bandwidth.
But be honest with yourself: are you actually running at high enough utilization to saturate an H100? Most teams aren't. Most inference deployments run at 10–30% GPU utilization. At that level, you're bandwidth-bound regardless of the GPU, and the cheaper option wins.
The Decision Framework
Here's the cheat sheet. Follow these rules and you'll pick the right GPU 90% of the time:
- 7B model or smaller: RTX 4090 at $0.39/hr. Cheap, fast enough, 24GB is plenty.
- 13B model: L40S at $0.88/hr or A6000 (48GB). Don't overshoot to an A100 unless you need the bandwidth.
- 30B model: A100 80GB at $0.34/hr. Best price-to-VRAM ratio on the market right now.
- 70B model: H200 at $1.84/hr if single-GPU matters. Two H100s at ~$3.74/hr if you need tensor parallelism.
- High-batch production serving: H100 — compute advantage actually kicks in at sustained high utilization.
Stop overpaying. Use our comparison tool to find the right GPU at the right price for your inference workload.