The Blackwell Generation Has Arrived
NVIDIA's Blackwell architecture is no longer a roadmap slide — it is shipping silicon you can rent right now. The B200, B300, and GB200 represent the most significant generational leap since Ampere to Hopper, and they fundamentally change the economics of large-model inference and training. If you are still planning your infrastructure around H100s without understanding what Blackwell offers, you are making decisions with outdated information.
This guide covers every Blackwell SKU available in the cloud today, real pricing from real providers, head-to-head comparisons with H100, and concrete guidance on when upgrading makes financial sense versus when the H100 remains the better pick.
The Blackwell Lineup: B200, B300, and GB200
NVIDIA released three primary Blackwell SKUs targeting different segments of the datacenter market. Understanding the differences is critical because the price gaps are enormous.
| Spec | B200 | B300 | GB200 | H100 (ref) |
|---|---|---|---|---|
| VRAM | 180 GB HBM3e | 262 GB HBM3e | 186 GB HBM3e (dual-die) | 80 GB HBM3 |
| Memory Bandwidth | 8 TB/s | 8 TB/s | 8 TB/s per die | 3.35 TB/s |
| FP8 Performance | 9 PFLOPS (2nd-gen) | 9 PFLOPS (2nd-gen) | 18 PFLOPS (dual-die) | 3.96 PFLOPS |
| FP16 Performance | 4.5 PFLOPS | 4.5 PFLOPS | 9 PFLOPS | 1.98 PFLOPS |
| TDP | 1000W | 1200W | 2700W (full system) | 700W |
| NVLink Bandwidth | 1.8 TB/s | 1.8 TB/s | 1.8 TB/s | 900 GB/s |
The B200 is the workhorse — 180 GB of HBM3e in a single GPU. This is the one most cloud providers are deploying. The B300 pushes VRAM to 262 GB, targeting customers who need to fit 100B+ parameter models on fewer GPUs. The GB200 is a dual-die superchip with a Grace ARM CPU directly attached — it is a complete compute node, not just a GPU, and it is priced accordingly.
Real Cloud Pricing Right Now (February 2025)
Blackwell GPUs are available today from multiple providers. Pricing has already started to normalize as supply increases, but there is still significant variation across providers.
| GPU | Provider | Spot Price | On-Demand Price |
|---|---|---|---|
| B200 180GB | Verda | $1.67/hr | — |
| B200 180GB | Vast.ai | — | $3.40/hr |
| B200 180GB | Lambda | — | $3.99/hr |
| B300 262GB | Verda | $2.45/hr | — |
| GB200 186GB | CoreWeave | — | $6.50/hr |
Key takeaway: B200 spot pricing at $1.67/hr from Verda is remarkable — that is only 45% more than H100 spot pricing from the cheapest providers, for a GPU that delivers 2.25x the VRAM and roughly 2.5x the inference throughput. On a per-token basis, B200 is already cheaper than H100.
B200 vs H100: The Deep-Dive Comparison
The B200 shares the same CUDA core count philosophy as the H100 but gains second-generation FP8 tensor cores, a dramatically wider memory bus, and more than double the VRAM. The headline numbers tell the story, but the real-world implications require deeper analysis.
VRAM: 180 GB vs 80 GB. This is the single biggest upgrade. A Llama-3 70B model in FP16 requires approximately 140 GB of VRAM just for the model weights. On an H100, this is physically impossible — you need tensor parallelism across two GPUs, which adds latency and doubles your cost. On a B200, the same model fits in a single GPU with 40 GB to spare for KV cache and activations. This alone eliminates an entire class of infrastructure complexity for 70B-scale inference.
Memory Bandwidth: 8 TB/s vs 3.35 TB/s. Memory bandwidth is the primary bottleneck for LLM inference (autoregressive decoding is memory-bound, not compute-bound). The B200 delivers 2.39x more bandwidth, which translates almost directly into faster token generation for single-batch inference. In practice, this means your time-to-first-token and inter-token latency drop significantly.
2nd-Gen FP8 Tensor Cores. The B200's FP8 performance reaches 9 PFLOPS compared to the H100's 3.96 PFLOPS. NVIDIA's second-generation FP8 implementation includes improved accuracy characteristics and better hardware support for mixed-precision training. For training workloads that can leverage FP8, this is a 2.27x improvement in peak throughput.
NVLink: 1.8 TB/s vs 900 GB/s. For multi-GPU configurations, the interconnect bandwidth doubled. If you are running 8x B200 with NVLink, the GPUs can communicate at 1.8 TB/s per link — making tensor parallelism and pipeline parallelism substantially more efficient.
Why Blackwell Matters: The 70B Inflection Point
The 70B parameter class is where Blackwell creates the most disruptive value change. Before the B200, serving a 70B model in FP16 required a minimum of two H100 GPUs — that meant 2x the hourly cost, tensor parallelism overhead, NVLink dependencies, and orchestration complexity. With the B200's 180 GB VRAM, you run the same model on a single GPU.
Here is the math. Llama-3.1 70B in FP16 requires about 140 GB for weights. On two H100s via Verda spot, you pay approximately $2.30/hr (2 x $1.15). On a single B200 via Verda spot, you pay $1.67/hr. That is a 27% cost reduction, with lower latency (no inter-GPU communication), simpler deployment, and fewer failure points. Multiply this across thousands of serving instances and the savings are substantial.
Even for quantized models, the extra VRAM pays dividends. A 70B model in INT4 (GPTQ/AWQ) occupies about 35 GB on an H100 — which leaves only 45 GB for the KV cache. On a B200, you have 145 GB free for KV cache, which means you can serve significantly more concurrent users before hitting memory limits. For high-throughput inference, this translates directly into higher revenue per GPU.
Real-World Performance: Benchmarks and Independent Testing
NVIDIA's official MLPerf submissions show the B200 achieving approximately 2.5x the inference throughput of the H100 on large language models. Independent benchmarks from cloud providers and the community generally confirm a 2x to 2.8x improvement depending on the workload, batch size, and model architecture.
For LLM serving with vLLM, early testers report that a single B200 matches or exceeds a 2x H100 setup on Llama-3.1 70B throughput while delivering 30-40% lower latency per request. The combination of higher bandwidth, larger VRAM (enabling larger batch sizes), and faster FP8 compute creates a compounding advantage that outperforms the simple TFLOPS ratio.
For training workloads, the advantage is more model-dependent. Compute-bound training (small models, large batches) sees roughly 2x improvement. Memory-bound training (large models with gradient checkpointing) can see up to 3x improvement because the B200 can hold more in VRAM, reducing recomputation overhead.
When to Use B200 Over H100
- 70B+ model inference in FP16 or FP8: Single-GPU deployment eliminates tensor parallelism complexity and reduces cost per token.
- High-throughput batch inference: The extra VRAM enables larger batch sizes, which increases GPU utilization and throughput per dollar.
- Training runs where memory is the bottleneck: If you are currently using gradient checkpointing or activation recomputation on H100, the B200 may let you fit everything in VRAM, dramatically accelerating training.
- Multi-model serving: With 180 GB, you can host multiple smaller models (e.g., a 7B + a 13B + an embedding model) on a single GPU, simplifying your infrastructure.
- Long-context inference: Models with 128K+ context windows generate enormous KV caches. The B200's extra VRAM keeps you from running out of memory at high sequence lengths.
When H100 Is Still the Right Choice
- Models under 70B parameters: If your model fits comfortably in 80 GB (anything up to ~30B FP16), the H100 delivers excellent performance at a lower hourly rate.
- LoRA fine-tuning of 7B-13B models: LoRA adapters add minimal VRAM overhead. An H100 handles these workloads with room to spare.
- Budget-conscious inference with quantized models: A 70B model in INT4 fits in an H100's 80 GB. If latency requirements are modest, this remains the cheapest option at ~$1.15/hr spot.
- Spot availability: H100 spot instances are widely available from many providers. B200 spot supply is still growing, and you may face availability constraints during peak demand.
Provider Availability and Recommendations
As of February 2025, B200 availability is expanding rapidly. Verda offers the cheapest spot pricing at $1.67/hr, making it the best choice for interruptible workloads like batch inference and experimentation. Vast.ai provides the most accessible on-demand pricing at $3.40/hr with a marketplace model that gives you flexibility in choosing configurations. Lambda sits at $3.99/hr on-demand with a more curated experience and better enterprise support.
The B300 is still limited in availability. Verda is the only provider offering it at $2.45/hr spot, and it is the right choice if you are working with 100B+ parameter models and want to minimize the number of GPUs required. At 262 GB, a single B300 can hold a 120B FP16 model — something that would require two B200s or four H100s.
The GB200 is a premium offering from CoreWeave at $6.50/hr. It is designed for customers who need the integrated Grace CPU and the dual-die configuration. Unless you have a specific need for the ARM CPU or the 18 PFLOPS of dual-die FP8 throughput, the standalone B200 is a better value.
Price Prediction: Where B200 Is Heading
Every GPU generation follows the same pricing curve: initial scarcity drives prices up, supply increases over 12-18 months, and prices normalize to a point where the performance-per-dollar significantly exceeds the previous generation. The H100 followed this pattern — on-demand prices dropped from $4+/hr at launch to under $2/hr at some providers today.
We predict B200 on-demand pricing will drop below $2.00/hr by Q4 2025 as TSMC ramps 4NP production and more cloud providers bring B200 inventory online. Spot pricing will likely settle around $1.00-1.30/hr, which will make the B200 the undisputed best value for any workload that benefits from more than 80 GB of VRAM.
If you are signing reserved contracts today, negotiate based on these projected rates. A 1-year commitment at $2.50/hr on-demand for a B200 is reasonable today but will look expensive by late 2025. Push for rate adjustments or shorter commitment periods.
The bottom line: the B200 is the new default choice for serious inference and training workloads. The H100 is not obsolete — it remains excellent value for sub-70B models — but for anything larger, Blackwell delivers a step-change in capability and economics. Check GPU Prices for the latest Blackwell pricing across all providers.