Question 1

How much does LLM inference cost on a cloud GPU?

Accepted Answer

Self-hosted LLM inference on cloud GPUs typically costs $0.01–$0.50 per million tokens, depending on the GPU model, LLM size, and quantization. This is 5–50× cheaper than API providers like OpenAI ($2.50/1M tokens for GPT-4o) or Anthropic ($3.00/1M for Claude 3.5 Sonnet).

Question 2

What GPU do I need for LLM inference?

Accepted Answer

VRAM is the primary constraint. A 7B parameter model needs ~14 GB in FP16 (fits on a single T4 or RTX 4090). A 70B model needs ~140 GB (fits on 2×A100 80GB or 2×H100). With INT4 quantization, VRAM requirements drop by ~75%.

Question 3

Is self-hosted inference cheaper than OpenAI?

Accepted Answer

Almost always yes at scale. At 1M tokens/day on a single A100, you pay roughly $48/day and generate tokens at ~$0.05/1M — 50× cheaper than GPT-4o at $2.50/1M. Below ~50K tokens/day, API overhead makes more sense.

Question 4

How does quantization affect cost?

Accepted Answer

Quantization increases throughput by 1.5–2.2× and reduces VRAM by 50–75%, directly lowering cost per token. Most modern LLMs show minimal quality degradation with FP8, and acceptable quality with INT4.

Question 5

What is the cheapest GPU for inference in 2026?

Accepted Answer

For small models (7B–13B), the NVIDIA T4 at $0.07–$0.20/hr offers the best cost per token. For 70B+, spot H100 instances at $1.50/hr are the sweet spot for cost-optimized production inference.

#	GPU	Provider	$/hr	Tok/s	$/1M Tokens	Monthly
1	5090 RTX5090 1× · 32 GB	Vast.ai	0.0940	33750	$0.0008	$0.02	Deploy
2	5090 RTX5090 1× · 32 GB	Vast.ai	0.100	33750	$0.0008	$0.02	Deploy
3	5090 RTX5090 1× · 32 GB	Vast.ai	0.120	33750	$0.0010	$0.03	Deploy
4	5090 RTX5090 2× · 32 GB	Vast.ai	0.267	67500	$0.0011	$0.03	Deploy
5	PRO6K RTXPRO6000 1× · 96 GB	GCP	0.304	67088	$0.0013	$0.04	Deploy

LLM Inference Cost Calculator

Best GPUs for Llama 3.1 8B

Cost per 1M tokens: Self-hosted vs API

Monthly cost at 1M/day

When to Self-Host LLM Inference

Use API providers when:

Self-host on cloud GPUs when:

Frequently Asked Questions

Get notified when inference costs drop