NVIDIA's Blackwell RTX 5090 is here — a $2,000 consumer GPU with 1,800 FP8 TFLOPS and 32GB GDDR7. For context, the H100 SXM has 1,979 FP8 TFLOPS and 80GB HBM3. On paper, a consumer card that costs the same as 1.5 hours of H100 cloud time delivers 91% of its theoretical compute. So the question everyone is asking: can you replace a $30,000 datacenter card with a $2,000 gaming GPU?
The short answer is: for inference of models that fit in 32GB, yes. For everything else, no. Let me show you exactly where the line is.
The Spec Comparison
| Spec | RTX 5090 (32GB) | H100 SXM (80GB) | RTX 4090 (24GB) |
|---|---|---|---|
| FP8 TFLOPS | 1,800 (est.) | 1,979 | 330 |
| VRAM | 32GB GDDR7 | 80GB HBM3 | 24GB GDDR6X |
| Memory Bandwidth | 1,792 GB/s | 3,350 GB/s | 1,008 GB/s |
| TDP | 575W | 700W (SXM) | 450W |
| Architecture | Blackwell (consumer) | Hopper | Ada Lovelace |
| NVLink | No | Yes (900 GB/s) | No |
| Cloud Price/hr | $0.70-1.20 (est.) | $1.29-1.87 | $0.39 |
| Buy Price | $1,999 | $25,000-40,000 | $1,599 |
Where the 5090 Wins: Small to Mid Model Inference
For models that fit in 32GB — Llama 3 8B FP16 (16GB), Llama 3 8B quantized (6-8GB), Mistral 7B, SDXL, Flux, and even 13B quantized models — the RTX 5090 is a monster. The 1,800 FP8 TFLOPS paired with GDDR7's 1,792 GB/s bandwidth means inference throughput approaching H100 levels. Early benchmarks suggest 90-100 tok/s on Llama 3 8B FP16 — nearly matching the H100's ~105 tok/s.
At an estimated cloud price of $0.70-1.20/hr when providers start offering them, the cost per token will be 30-50% lower than the H100 for models that fit in VRAM. This is the RTX 4090 story all over again, but with 33% more VRAM and 5.5x more compute.
Where the H100 Still Dominates
The H100's advantages are structural and will not go away:
- 80GB HBM3 vs. 32GB GDDR7: A 70B model in FP16 needs ~140GB VRAM. Even quantized to 4-bit, it needs ~35GB. The 5090 cannot load it. The H100 can. For any model above ~13B parameters in FP16 or ~25B quantized, the H100 (or H200) is the only single-GPU option.
- 3,350 GB/s vs. 1,792 GB/s bandwidth: For memory-bound workloads (long sequence generation, large KV caches), the H100's HBM3 delivers nearly 2x the bandwidth. This means faster token generation at longer contexts.
- NVLink at 900 GB/s: For multi-GPU training and inference, the H100 connects GPU-to-GPU at 900 GB/s. The 5090 has no NVLink — multi-GPU communication goes over PCIe at ~64 GB/s. That is a 14x difference in inter-GPU bandwidth. Multi-GPU 5090 setups for training are effectively useless.
- Training at scale: Large model training requires multi-node communication, large batch sizes, and gradient synchronization. The H100 with InfiniBand in data center configurations is purpose-built for this. The 5090 is not.
The Real Comparison: 5090 vs 4090
The more honest comparison is 5090 vs. 4090. Here, the 5090 is a clear generational leap:
- 5.5x more FP8 TFLOPS (1,800 vs. 330)
- 33% more VRAM (32GB vs. 24GB) — enough for 13B FP16 models that barely fit on the 4090
- 1.78x more bandwidth (1,792 vs. 1,008 GB/s) — faster token generation
- GDDR7 vs. GDDR6X — newer memory technology with better efficiency
If you are currently renting RTX 4090s for inference, the 5090 will be a straightforward upgrade the moment cloud providers start offering them. Expect 2-3x better inference throughput per dollar once prices stabilize.
The Verdict
5090 wins: Inference of models under 25B quantized. Image generation (SDXL, Flux, SD3). Cost-per-token for small/mid models. Budget-conscious inference at scale.
H100 wins: Models over 30B. Multi-GPU training. Long-context inference. Enterprise workloads needing 80GB+ VRAM. Anything requiring NVLink.
Wait and see: Cloud pricing for RTX 5090 instances is not settled yet. Current estimates of $0.70-1.20/hr could change significantly. Check back on our tracker once providers start listing them.
Track 5090 availability: We will add RTX 5090 instances to our GPU price comparison as soon as cloud providers start offering them.