"How fast is an H100 compared to an A100?" Everyone asks this. Nobody answers it with real pricing data. So I took Llama 3 8B — the most commonly deployed open-source model — and mapped out the economics of running it on 10 different GPUs at their actual cloud prices. Not benchmarks from NVIDIA marketing decks. Not theoretical TFLOPS. Real prices from real providers, right now.
The results are going to surprise you. The most expensive GPU is not the fastest per dollar. The cheapest GPU is not the slowest. And the "best" GPU depends entirely on whether you care about latency (time to first token), throughput (tokens per second), or cost (dollars per million tokens).
The Setup
Model: Meta Llama 3 8B Instruct, FP16 (no quantization). Why no quantization? Because I want to test the GPUs, not the quantization algorithm. Every GPU gets the same model at the same precision. The inference stack is vLLM with default settings, batch size 1, 512 input tokens, 128 output tokens. Prices are median on-demand from our tracker as of February 2026.
The Results
| GPU | VRAM | Price/hr | ~Tok/s | $/1M Tokens | Rank (Cost) |
|---|---|---|---|---|---|
| H100 SXM | 80GB | $1.87 | ~105 | $4.95 | #5 |
| H200 | 141GB | $1.84 | ~135 | $3.79 | #3 |
| A100 80GB | 80GB | $1.10 | ~55 | $5.56 | #6 |
| A100 40GB | 40GB | $0.86 | ~52 | $4.60 | #4 |
| L40S | 48GB | $0.69 | ~68 | $2.82 | #2 |
| RTX 4090 | 24GB | $0.39 | ~82 | $1.32 | #1 WINNER |
| RTX 3090 | 24GB | $0.15 | ~42 | $0.99 | #1 (spot) |
| A6000 | 48GB | $0.47 | ~35 | $3.73 | #3 |
| A10G | 24GB | $0.75 | ~28 | $7.44 | #7 |
| T4 | 16GB | $0.53 | ~12 | $12.27 | #8 |
The RTX 4090 Wins. It Should Not Be This Good.
At $1.32 per million tokens on-demand, the RTX 4090 beats every datacenter GPU for Llama 3 8B inference. It is 3.8x cheaper per token than the A100 80GB and 3.7x cheaper than the H100. The reason: at $0.39/hr, you are paying consumer GPU prices but getting Ada Lovelace FP8 tensor core performance that puts out 82 tokens/sec — not far behind the H100's 105 tok/s.
The RTX 3090 on spot at $0.15/hr is even cheaper at $0.99/1M tokens, but availability is inconsistent and it is Ampere (no FP8), so throughput is lower at 42 tok/s. For a production service that needs consistent availability, the 4090 on-demand is the move.
The Surprise: The L40S Is #2
The L40S at $0.69/hr delivers 68 tok/s — faster than the A100 — at $2.82/1M tokens. It has 48GB VRAM (enough for Llama 3 8B in FP16 with plenty of KV cache headroom) and Ada Lovelace FP8 support. Most people skip it because it does not have the brand recognition of the A100 or H100. That is a mistake.
The Disappointment: The A10G
AWS charges $0.75/hr for an A10G on g5 instances. That puts cost per million tokens at $7.44 — 5.6x more expensive than an RTX 4090 for the same model. The A10G has 24GB VRAM and only 125 TFLOPS. At $0.75/hr, you are paying a massive AWS premium for a GPU that is objectively worse than alternatives available on RunPod or Vast.ai for less than half the price.
When to Ignore This Table
This analysis is for single-request inference of 8B models. The ranking changes completely for:
- Batch inference: The H100 and H200 pull ahead because their massive bandwidth and VRAM allow much higher batch sizes, amortizing the per-token cost.
- 70B+ models: The RTX 4090 cannot even load these. H100, H200, or A100 80GB only.
- Training: Completely different ranking. Multi-GPU NVLink scaling matters, and the H100/A100 dominate.
- Enterprise SLAs: If you need guaranteed uptime and compliance, you are paying the AWS/Azure/GCP premium regardless of raw cost-per-token.
Check live prices: These numbers change weekly. Compare real-time GPU prices across all 54+ providers we track.