Cost Per Token: Which Cloud GPU Is Actually Cheapest for LLM Inference in 2026

The hourly rate of a GPU tells you almost nothing useful when choosing compute for LLM inference. An H100 at $2.99/hr and an RTX 4090 at $0.34/hr are both "expensive" or "cheap" depending on how fast they generate tokens. What actually matters is cost per million tokens — how much you pay to generate a specific volume of output.

We calculated cost-per-token for the major cloud GPUs using current prices from GPU Tracker's live feed (March 2026) and standard throughput benchmarks on Llama-class models. The results are counterintuitive.

The Math: How Cost Per Token Works

For a given GPU running a specific model at batch size 1:

cost_per_million_tokens = (price_per_hour / tokens_per_second / 3600) × 1,000,000

A GPU that costs twice as much per hour but generates three times the tokens is 1.5x cheaper per token. The ratio of price to throughput is what determines value — not the hourly rate alone.

7B Model Inference: Cost Per Million Tokens

Using throughput benchmarks for Llama 3 8B at float16, batch size 1 (single-request latency — the most common inference scenario). Prices reflect cheapest available spot price from the live feed.

GPU	Spot Price	~Tok/sec	Cost/M Tokens	vs GPT-4o
A100 80GB	$0.08/hr	3,500	$0.006	2,500× cheaper
RTX 3090 24GB	$0.05/hr	1,800	$0.008	1,875× cheaper
RTX 5090 32GB	$0.13/hr	3,200	$0.011	1,364× cheaper
H200 141GB	$0.33/hr	7,500	$0.012	1,250× cheaper
RTX 4090 24GB	$0.17/hr	2,500	$0.019	789× cheaper
L40S 48GB	$0.26/hr	2,800	$0.026	577× cheaper
H100 80GB	$0.80/hr	6,000	$0.037	405× cheaper
T4 16GB	$0.07/hr	400	$0.049	306× cheaper

Throughput estimates for Llama 3 8B fp16, single-request, batch size 1. Actual throughput varies with model size, quantization, batch size, and vLLM/TGI configuration. GPT-4o pricing estimated at $15/M output tokens.

Key insight: The A100 spot at $0.08/hr is the cheapest per token for 7B models — not because it's the fastest GPU, but because it's by far the cheapest in absolute terms while still being fast enough. An RTX 4090 spot at $0.17/hr costs 3x more per token than an A100 spot despite generating fewer tokens. The A100 spot market is the hidden gem of GPU cloud pricing.

70B Model Inference: What Changes

For 70B models (Llama 3 70B, Qwen 72B, etc.), you need 80GB+ VRAM for quantized inference or 140GB+ for FP8. This eliminates consumer GPUs entirely and changes the cost picture significantly.

GPU	VRAM	Spot Price	~Tok/sec (70B Q4)	Cost/M Tokens
A100 80GB	80GB	$0.08/hr	800	$0.028
H100 80GB	80GB	$0.80/hr	1800	$0.123
H200 141GB	141GB	$0.33/hr	2200	$0.042
B200 180GB	180GB	$1.67/hr	4000	$0.116

For 70B inference, the A100 80GB spot ($0.08/hr) is again the cheapest per token by a large margin. It runs quantized 70B inference more slowly than an H100, but at 4x cheaper per token, you'd need to generate 4x more tokens on an H100 just to break even on the price difference. The A100 wins for cost-sensitive inference at any reasonable volume.

The H200 spot ($0.33/hr) is the interesting outlier: it runs 70B models in Q8 mode (141GB VRAM accommodates it), giving substantially better quality than Q4. The $0.042/M token rate makes it the best choice when output quality matters as much as cost.

When the H100 Actually Makes Sense for Inference

The H100 is not cost-optimal for inference in most scenarios. It only makes financial sense when:

Latency is the constraint, not cost: The H100 generates tokens faster, which reduces time-to-first-token and improves user experience. If you're building a product where latency degrades conversion, the H100 may be worth the premium.
High-concurrency throughput: At batch sizes above 64+, the H100's compute advantage over the A100 grows significantly. For a high-traffic inference endpoint processing thousands of concurrent requests, the H100 at $2.99/hr on-demand may be cheaper total than 5 A100s at $0.34/hr each.
Specialized FP8 workloads: The H100's native FP8 support gives 1.5–2x throughput improvement for models deployed with FP8 quantization. For high-volume production deployments using FP8, the effective cost-per-token narrows significantly.

Recommendations by Workload

Use Case	Best GPU	Why
7B model, cost-first	A100 80GB spot ($0.08/hr)	Cheapest cost-per-token, fast enough for batch
7B model, latency-first	RTX 4090 on-demand ($0.34/hr)	Fast inference, widely available, no interruption
13B model inference	RTX 5090 spot ($0.13/hr)	32GB VRAM handles Q8, cheapest per token at this size
70B model, cost-first	A100 80GB spot ($0.08/hr)	Q4 70B at $0.028/M tokens is unbeatable
70B model, quality-first	H200 spot ($0.33/hr)	141GB VRAM enables Q8, competitive cost/token
High-concurrency production	H100 on-demand ($1.58–2.99/hr)	Compute wins at large batch sizes
Tight budget, max scale	RTX 3090 spot ($0.05/hr)	Cheapest absolute price, handles 7B well

See current prices for all GPUs at GPU Tracker. Use the workload recipes tool for a guided recommendation.

How Quantization Changes Cost Per Token

Quantization (running at lower precision than FP16) reduces VRAM use and increases throughput on memory-bound workloads. The cost-per-token impact across the major precision levels for a 13B model:

Precision	VRAM (13B model)	Throughput vs FP16	Cost/M Tokens (H100)	Quality Loss
FP16 (baseline)	~26 GB	1.0×	$0.052	None
FP8 (H100/H200 native)	~14 GB	1.5–1.9×	$0.030	Minimal (<1%)
INT8 (AWQ, GPTQ)	~14 GB	1.3–1.6×	$0.036	Negligible
INT4 (AWQ Q4)	~8 GB	1.8–2.4×	$0.024	1–3% benchmark drop
NF4 (bitsandbytes)	~7 GB	1.4–1.7×	$0.033	1–4% benchmark drop

Takeaway: INT4 (AWQ) is the cost-per-token sweet spot for production inference on most workloads — half the cost of FP16 with quality loss usually under 3% on standard benchmarks. FP8 on H100/H200 is the best choice when you need provably indistinguishable quality.

Batch Size: The Lever That Changes Everything

Single-request inference (batch=1) is what most teams measure. But the real economics emerge at batch sizes typical of a production endpoint. Cost per million tokens on a single H100 SXM running Llama 3 8B INT4:

Batch Size	Throughput (tok/s)	Latency (TTFT)	Cost/M tokens	Use case
1 (interactive)	~6,000	~40 ms	$0.037	Single user chat
8 (small endpoint)	~28,000	~80 ms	$0.008	Small SaaS API
32 (medium endpoint)	~75,000	~150 ms	$0.003	Multi-tenant inference
128 (high-throughput)	~180,000	~400 ms	$0.0012	Batch processing, RAG
512 (max throughput)	~280,000	~1.5 s	$0.0008	Offline batch inference

A single H100 running at batch size 128 delivers tokens at $0.0012/M — roughly 30× cheaper per token than at batch size 1, and competitive with the cheapest API endpoints from OpenAI, Anthropic, and Together AI. The catch: you only hit those batch sizes with steady concurrent traffic. Bursty workloads can't fill the batch and the per-token economics get worse.

Methodology

Pricing data: All GPU hourly rates come from GPU Tracker's live feed scraped every 6 hours across 54+ providers. Spot prices reflect the cheapest currently available listing.

Throughput estimates: Tokens/second values are sourced from published vLLM and TGI benchmarks (vLLM v0.6.x, TGI 2.x) for Llama 3 family models at the specified precision and batch size. Where multiple benchmarks exist, we use the median.

Cost calculation: cost_per_million_tokens = (price_per_hour ÷ tokens_per_second ÷ 3600) × 1,000,000. This assumes 100% GPU utilization — real-world utilization is typically 30–70%, which proportionally increases effective cost.

What we did not model: Network egress costs, storage for model weights, the engineering time of running inference yourself vs calling an API. For workloads under ~10M tokens/month, an API endpoint is usually cheaper once those overheads are included.

Cost-Per-Token FAQ

Which cloud GPU has the lowest cost per token for LLM inference?▾

For 7B and 13B models, A100 80GB spot at $0.08/hr is the cheapest per token in 2026 (~$0.006/M tokens at batch=1). For 70B models, the same A100 80GB spot wins again at ~$0.028/M tokens with Q4 quantization. The H100 only beats the A100 on cost per token when you can sustain batch size 64+ — i.e., a steady production endpoint.

Is it cheaper to run my own LLM than use OpenAI or Anthropic APIs?▾

It depends on volume. At under ~5M tokens/month, API endpoints are almost always cheaper once you account for engineering time, monitoring, and idle GPU costs. At 50M+ tokens/month of steady traffic, self-hosted on an H100 or A100 spot becomes 10–50× cheaper per token. Between 5M and 50M, the answer depends on how steady your traffic is.

Why does batch size matter so much for inference cost?▾

LLM inference is memory-bandwidth bound at small batch sizes — most of the GPU's compute sits idle waiting on weights to load. Larger batches amortize the weight-loading cost across more tokens, so throughput rises near-linearly with batch size up to a saturation point. A single H100 can deliver ~30× more tokens/second at batch 128 than at batch 1.

Does FP8 quantization on H100 actually save money?▾

Yes. FP8 cuts memory bandwidth use in half on H100/H200 (which have native FP8 hardware), delivering 1.5–1.9× throughput vs FP16 with negligible quality loss. On Llama 3 8B FP8 on H100, the effective cost per million tokens drops from ~$0.052 to ~$0.030 — a real 40%+ savings.

What is the best GPU for running Llama 3 8B?▾

For pure cost: A100 80GB spot ($0.08/hr) or RTX 4090 on-demand ($0.34/hr). For latency-sensitive workloads: RTX 4090 has the fastest single-request time-to-first-token among consumer-priced GPUs. For production endpoints with steady traffic: a single H100 SXM at batch size 32+ beats everything on cost per token.

Can I run Llama 70B on consumer GPUs?▾

Yes, with INT4 quantization (AWQ or GGUF). Llama 70B Q4 needs ~40 GB, so two RTX 4090s (2×24 GB = 48 GB) can run it via tensor parallelism. The downside: consumer GPUs lack the NVLink bandwidth of datacenter cards, so single-GPU A100 80GB or H100 80GB is usually faster and cheaper per token.

How do MoE models like Mixtral and Llama 4 affect cost per token?▾

Mixture-of-experts models activate only 2–4 of N total experts per token, so per-token compute is closer to the active parameter count (often 12–17B) than the total parameter count (often 100B+). This means MoE inference can be 4–8× cheaper per token than dense models of equivalent quality — provided your GPU has enough VRAM to hold all experts at once.

Does spot interruption affect inference cost in practice?▾

For inference APIs serving real users: yes, materially. Plan for 1–3% downtime/month on spot H100s and have a fallback (on-demand backup or queue). For batch inference jobs (RAG indexing, dataset augmentation): use spot freely and resume from checkpoints — the 60% cost saving easily justifies the engineering overhead.

How accurate are your throughput numbers?▾

They are within ±15% of what you'll measure yourself under the same conditions (same vLLM/TGI version, same precision, same context length, same hardware). Real numbers can drift further if you change the batch scheduler, use draft-model speculative decoding, or run on a different driver version. Always benchmark your specific workload before locking in a GPU choice.

Where can I see live cost-per-token data across GPUs?▾

Use our live calculator at the homepage — it pulls current GPU prices from our 6-hour-refresh feed and shows per-token cost for the major Llama, Qwen, and Mistral models at common batch sizes. Updated continuously as new pricing comes in.