Skip to main content
inferencellmpricingcomparison

Cost Per Token: Which Cloud GPU Is Actually Cheapest for LLM Inference in 2026

Comparing H100 vs A100 vs RTX 4090 vs L40S for LLM inference by cost per million tokens — not just hourly rate. The RTX 4090 at $0.17/hr beats most datacenter GPUs on cost efficiency for 7B–13B models.

March 19, 202610 min read

The hourly rate of a GPU tells you almost nothing useful when choosing compute for LLM inference. An H100 at $2.99/hr and an RTX 4090 at $0.34/hr are both "expensive" or "cheap" depending on how fast they generate tokens. What actually matters is cost per million tokens — how much you pay to generate a specific volume of output.

We calculated cost-per-token for the major cloud GPUs using current prices from GPU Tracker's live feed (March 2026) and standard throughput benchmarks on Llama-class models. The results are counterintuitive.

The Math: How Cost Per Token Works

For a given GPU running a specific model at batch size 1:

cost_per_million_tokens = (price_per_hour / tokens_per_second / 3600) × 1,000,000

A GPU that costs twice as much per hour but generates three times the tokens is 1.5x cheaper per token. The ratio of price to throughput is what determines value — not the hourly rate alone.

7B Model Inference: Cost Per Million Tokens

Using throughput benchmarks for Llama 3 8B at float16, batch size 1 (single-request latency — the most common inference scenario). Prices reflect cheapest available spot price from the live feed.

GPUSpot Price~Tok/secCost/M Tokensvs GPT-4o
A100 80GB$0.08/hr3,500$0.0062,500× cheaper
RTX 3090 24GB$0.05/hr1,800$0.0081,875× cheaper
RTX 5090 32GB$0.13/hr3,200$0.0111,364× cheaper
H200 141GB$0.33/hr7,500$0.0121,250× cheaper
RTX 4090 24GB$0.17/hr2,500$0.019789× cheaper
L40S 48GB$0.26/hr2,800$0.026577× cheaper
H100 80GB$0.80/hr6,000$0.037405× cheaper
T4 16GB$0.07/hr400$0.049306× cheaper

Throughput estimates for Llama 3 8B fp16, single-request, batch size 1. Actual throughput varies with model size, quantization, batch size, and vLLM/TGI configuration. GPT-4o pricing estimated at $15/M output tokens.

Key insight: The A100 spot at $0.08/hr is the cheapest per token for 7B models — not because it's the fastest GPU, but because it's by far the cheapest in absolute terms while still being fast enough. An RTX 4090 spot at $0.17/hr costs 3x more per token than an A100 spot despite generating fewer tokens. The A100 spot market is the hidden gem of GPU cloud pricing.

70B Model Inference: What Changes

For 70B models (Llama 3 70B, Qwen 72B, etc.), you need 80GB+ VRAM for quantized inference or 140GB+ for FP8. This eliminates consumer GPUs entirely and changes the cost picture significantly.

GPUVRAMSpot Price~Tok/sec (70B Q4)Cost/M Tokens
A100 80GB80GB$0.08/hr800$0.028
H100 80GB80GB$0.80/hr1800$0.123
H200 141GB141GB$0.33/hr2200$0.042
B200 180GB180GB$1.67/hr4000$0.116

For 70B inference, the A100 80GB spot ($0.08/hr) is again the cheapest per token by a large margin. It runs quantized 70B inference more slowly than an H100, but at 4x cheaper per token, you'd need to generate 4x more tokens on an H100 just to break even on the price difference. The A100 wins for cost-sensitive inference at any reasonable volume.

The H200 spot ($0.33/hr) is the interesting outlier: it runs 70B models in Q8 mode (141GB VRAM accommodates it), giving substantially better quality than Q4. The $0.042/M token rate makes it the best choice when output quality matters as much as cost.

When the H100 Actually Makes Sense for Inference

The H100 is not cost-optimal for inference in most scenarios. It only makes financial sense when:

  • Latency is the constraint, not cost: The H100 generates tokens faster, which reduces time-to-first-token and improves user experience. If you're building a product where latency degrades conversion, the H100 may be worth the premium.
  • High-concurrency throughput: At batch sizes above 64+, the H100's compute advantage over the A100 grows significantly. For a high-traffic inference endpoint processing thousands of concurrent requests, the H100 at $2.99/hr on-demand may be cheaper total than 5 A100s at $0.34/hr each.
  • Specialized FP8 workloads: The H100's native FP8 support gives 1.5–2x throughput improvement for models deployed with FP8 quantization. For high-volume production deployments using FP8, the effective cost-per-token narrows significantly.

Recommendations by Workload

Use CaseBest GPUWhy
7B model, cost-firstA100 80GB spot ($0.08/hr)Cheapest cost-per-token, fast enough for batch
7B model, latency-firstRTX 4090 on-demand ($0.34/hr)Fast inference, widely available, no interruption
13B model inferenceRTX 5090 spot ($0.13/hr)32GB VRAM handles Q8, cheapest per token at this size
70B model, cost-firstA100 80GB spot ($0.08/hr)Q4 70B at $0.028/M tokens is unbeatable
70B model, quality-firstH200 spot ($0.33/hr)141GB VRAM enables Q8, competitive cost/token
High-concurrency productionH100 on-demand ($1.58–2.99/hr)Compute wins at large batch sizes
Tight budget, max scaleRTX 3090 spot ($0.05/hr)Cheapest absolute price, handles 7B well

See current prices for all GPUs at GPU Tracker. Use the workload recipes tool for a guided recommendation.

Stay ahead on GPU pricing

Get weekly GPU price reports, new hardware analysis, and cost optimization tips. Join engineers and researchers who save thousands on cloud compute.

No spam. Unsubscribe anytime. We respect your inbox.

Find the cheapest GPU for your workload

Compare real-time prices across tracked cloud providers and marketplaces with 5,000+ instances. Updated every 6 hours.

Compare GPU Prices →

Related Articles