The hourly rate of a GPU tells you almost nothing useful when choosing compute for LLM inference. An H100 at $2.99/hr and an RTX 4090 at $0.34/hr are both "expensive" or "cheap" depending on how fast they generate tokens. What actually matters is cost per million tokens — how much you pay to generate a specific volume of output.
We calculated cost-per-token for the major cloud GPUs using current prices from GPU Tracker's live feed (March 2026) and standard throughput benchmarks on Llama-class models. The results are counterintuitive.
The Math: How Cost Per Token Works
For a given GPU running a specific model at batch size 1:
A GPU that costs twice as much per hour but generates three times the tokens is 1.5x cheaper per token. The ratio of price to throughput is what determines value — not the hourly rate alone.
7B Model Inference: Cost Per Million Tokens
Using throughput benchmarks for Llama 3 8B at float16, batch size 1 (single-request latency — the most common inference scenario). Prices reflect cheapest available spot price from the live feed.
| GPU | Spot Price | ~Tok/sec | Cost/M Tokens | vs GPT-4o |
|---|---|---|---|---|
| A100 80GB | $0.08/hr | 3,500 | $0.006 | 2,500× cheaper |
| RTX 3090 24GB | $0.05/hr | 1,800 | $0.008 | 1,875× cheaper |
| RTX 5090 32GB | $0.13/hr | 3,200 | $0.011 | 1,364× cheaper |
| H200 141GB | $0.33/hr | 7,500 | $0.012 | 1,250× cheaper |
| RTX 4090 24GB | $0.17/hr | 2,500 | $0.019 | 789× cheaper |
| L40S 48GB | $0.26/hr | 2,800 | $0.026 | 577× cheaper |
| H100 80GB | $0.80/hr | 6,000 | $0.037 | 405× cheaper |
| T4 16GB | $0.07/hr | 400 | $0.049 | 306× cheaper |
Throughput estimates for Llama 3 8B fp16, single-request, batch size 1. Actual throughput varies with model size, quantization, batch size, and vLLM/TGI configuration. GPT-4o pricing estimated at $15/M output tokens.
Key insight: The A100 spot at $0.08/hr is the cheapest per token for 7B models — not because it's the fastest GPU, but because it's by far the cheapest in absolute terms while still being fast enough. An RTX 4090 spot at $0.17/hr costs 3x more per token than an A100 spot despite generating fewer tokens. The A100 spot market is the hidden gem of GPU cloud pricing.
70B Model Inference: What Changes
For 70B models (Llama 3 70B, Qwen 72B, etc.), you need 80GB+ VRAM for quantized inference or 140GB+ for FP8. This eliminates consumer GPUs entirely and changes the cost picture significantly.
| GPU | VRAM | Spot Price | ~Tok/sec (70B Q4) | Cost/M Tokens |
|---|---|---|---|---|
| A100 80GB | 80GB | $0.08/hr | 800 | $0.028 |
| H100 80GB | 80GB | $0.80/hr | 1800 | $0.123 |
| H200 141GB | 141GB | $0.33/hr | 2200 | $0.042 |
| B200 180GB | 180GB | $1.67/hr | 4000 | $0.116 |
For 70B inference, the A100 80GB spot ($0.08/hr) is again the cheapest per token by a large margin. It runs quantized 70B inference more slowly than an H100, but at 4x cheaper per token, you'd need to generate 4x more tokens on an H100 just to break even on the price difference. The A100 wins for cost-sensitive inference at any reasonable volume.
The H200 spot ($0.33/hr) is the interesting outlier: it runs 70B models in Q8 mode (141GB VRAM accommodates it), giving substantially better quality than Q4. The $0.042/M token rate makes it the best choice when output quality matters as much as cost.
When the H100 Actually Makes Sense for Inference
The H100 is not cost-optimal for inference in most scenarios. It only makes financial sense when:
- Latency is the constraint, not cost: The H100 generates tokens faster, which reduces time-to-first-token and improves user experience. If you're building a product where latency degrades conversion, the H100 may be worth the premium.
- High-concurrency throughput: At batch sizes above 64+, the H100's compute advantage over the A100 grows significantly. For a high-traffic inference endpoint processing thousands of concurrent requests, the H100 at $2.99/hr on-demand may be cheaper total than 5 A100s at $0.34/hr each.
- Specialized FP8 workloads: The H100's native FP8 support gives 1.5–2x throughput improvement for models deployed with FP8 quantization. For high-volume production deployments using FP8, the effective cost-per-token narrows significantly.
Recommendations by Workload
| Use Case | Best GPU | Why |
|---|---|---|
| 7B model, cost-first | A100 80GB spot ($0.08/hr) | Cheapest cost-per-token, fast enough for batch |
| 7B model, latency-first | RTX 4090 on-demand ($0.34/hr) | Fast inference, widely available, no interruption |
| 13B model inference | RTX 5090 spot ($0.13/hr) | 32GB VRAM handles Q8, cheapest per token at this size |
| 70B model, cost-first | A100 80GB spot ($0.08/hr) | Q4 70B at $0.028/M tokens is unbeatable |
| 70B model, quality-first | H200 spot ($0.33/hr) | 141GB VRAM enables Q8, competitive cost/token |
| High-concurrency production | H100 on-demand ($1.58–2.99/hr) | Compute wins at large batch sizes |
| Tight budget, max scale | RTX 3090 spot ($0.05/hr) | Cheapest absolute price, handles 7B well |
See current prices for all GPUs at GPU Tracker. Use the workload recipes tool for a guided recommendation.