Skip to main content
inferencepricing

The Real Cost Per Token of Self-Hosted LLM Inference

Self-hosted Llama 3 70B on an A100 costs ~$0.17/1M tokens. GPT-4 costs $10-30/1M tokens. We show the full math, including the hidden costs.

January 24, 202511 min read

Here is a number that should make every engineering manager reconsider their API strategy: self-hosting Llama 3 70B on an A100 80GB at $0.34/hr produces inference at approximately $0.05 per million tokens. GPT-4o costs $2.50 per million input tokens and $10.00 per million output tokens. That is a 50x to 200x price difference, depending on your input/output ratio. And before you say "but GPT-4 is a better model" — for many production use cases, a fine-tuned 70B open-source model matches or exceeds GPT-4 quality on domain-specific tasks. You are paying a 200x premium for generality you may not need.

This article breaks down the real cost-per-token math for self-hosted inference, including the throughput calculations most blog posts conveniently skip. We will cover multiple GPU configurations, model sizes, and deployment scenarios so you can calculate your own numbers and make an informed build-vs-buy decision.

The Math: How to Calculate Cost Per Token

Cost per token for self-hosted inference is straightforward: divide your hourly GPU cost by the number of tokens you generate per hour. The tricky part is accurately estimating throughput, which depends on model size, GPU specs, quantization, batch size, and your inference framework. Let me walk through each variable.

Throughput Estimation

For autoregressive token generation (the way all LLMs work), the bottleneck is memory bandwidth. Each token requires reading the entire model's weights from VRAM once. The theoretical maximum token rate is:

Max tokens/sec = Memory Bandwidth / Model Size in Bytes

For a 70B FP16 model (140 GB) on an A100 80GB with 2 TB/s bandwidth: this model does not fit on one A100, so you would need tensor parallelism across two cards, or use quantization. With GPTQ 4-bit quantization, the 70B model shrinks to roughly 35GB. On a single A100 80GB: 2,000 GB/s / 35 GB = ~57 tokens/sec theoretical maximum. In practice, overhead from attention computation, KV cache, and framework inefficiency reduces this to about 35-45 tokens/sec per user at batch size 1.

With continuous batching frameworks like vLLM or TGI, you can serve multiple concurrent requests. At batch size 8-16, aggregate throughput jumps to 120-200 tokens/sec because the GPU transitions from bandwidth-bound to compute-bound, and the A100's 312 TFLOPS of tensor compute kicks in.

Cost Per Million Tokens: Self-Hosted vs. API

ConfigurationGPU CostThroughput$/1M Tokens
Llama 3 70B (4-bit) on A100 80GB$0.34/hr (Vultr)~150 tok/s (batched)$0.63
Llama 3 70B (4-bit) on A100 spot$0.09/hr (Vast.ai)~150 tok/s (batched)$0.17
Llama 3 70B (FP16) on H200$1.84/hr (Vast.ai)~200 tok/s (batched)$2.56
Llama 3 8B (FP16) on RTX 4090$0.39/hr (CloudRift)~280 tok/s (batched)$0.39
Llama 3 8B (FP16) on RTX 4090 spot$0.17/hr (Vast.ai)~280 tok/s (batched)$0.17
GPT-4o (OpenAI API)N/AN/A$2.50-$10.00
GPT-4 Turbo (OpenAI API)N/AN/A$10.00-$30.00
Claude 3.5 Sonnet (Anthropic API)N/AN/A$3.00-$15.00

Read those numbers carefully. A quantized Llama 3 70B on a spot A100 produces tokens at $0.17 per million tokens. GPT-4 Turbo output tokens cost $30 per million. That is a 176x price difference. Even the most expensive self-hosted configuration — FP16 on an H200 — is still 4-12x cheaper than commercial APIs.

The Utilization Trap: Why the Math Changes at Low Volume

The calculation above assumes your GPU is busy generating tokens continuously. In reality, most inference deployments have variable load — peaks during business hours, troughs at night and weekends. If your GPU sits idle 80% of the time, your effective cost per token jumps 5x. Here is how utilization affects the economics:

GPU UtilizationEffective $/1M Tokens (A100 at $0.34/hr)vs GPT-4o Output ($10)
100% (continuous batching)$0.6316x cheaper
50%$1.268x cheaper
20%$3.153x cheaper
5%$12.60API is cheaper

The breakeven point for self-hosting versus GPT-4o is approximately 8-10% GPU utilization. If you are generating fewer than roughly 50,000 tokens per hour on your self-hosted GPU, you are probably better off using an API. But if you are generating more than 100,000 tokens per hour consistently, self-hosting is dramatically cheaper.

The Hidden Costs of Self-Hosting

The GPU cost is not the only cost. Self-hosting adds engineering overhead that the per-token math does not capture:

  • Infrastructure engineering: Setting up vLLM, TGI, or another serving framework. Configuring load balancing, health checks, auto-scaling. Ongoing maintenance and updates. Budget 40-80 hours of engineering time for initial setup and 5-10 hours per month for maintenance.
  • Model management: Downloading weights, managing model versions, handling quantization. Minor but not zero — especially if you are iterating on fine-tuned models.
  • Monitoring and reliability: Building alerting for GPU health, inference latency, error rates. If your inference endpoint goes down at 3 AM, who gets paged?
  • Quality assurance: An API provider handles model quality. When you self-host, you own the responsibility of ensuring the model performs adequately, including handling edge cases, guardrails, and content filtering.

For a startup with a small team, these costs can be significant. But for a company processing millions of tokens per day, the engineering cost is amortized across massive savings. A team generating 100 million tokens per day on GPT-4o output would spend $30,000/day or $900,000/month. Self-hosting the same throughput on A100 spot instances would cost approximately $510/day or $15,300/month. The $884,700/month difference pays for a lot of engineering overhead.

The controversial claim: Any company spending more than $5,000/month on LLM API calls should be evaluating self-hosting. At that spend level, a single A100 80GB at $0.34/hr ($245/month) can serve the same throughput that costs $5,000+ via API. The math is not close. The only question is whether your team has the engineering bandwidth to set up and maintain the infrastructure. Use our comparison tool to find the cheapest GPUs for your inference workload.

Optimizing Self-Hosted Inference Cost

Once you have decided to self-host, here are the levers to minimize cost per token:

  • Use quantization aggressively. GPTQ or AWQ 4-bit reduces VRAM by 4x with 1-3% quality loss. This lets you fit larger models on cheaper GPUs.
  • Use continuous batching. Frameworks like vLLM batch multiple requests dynamically, increasing aggregate throughput 3-5x over naive sequential serving.
  • Right-size your GPU. Do not use an H100 at $1.87/hr if an A100 at $0.34/hr handles your throughput requirements. Check our trends page for the latest prices.
  • Use spot instances for non-critical inference. Spot A100s at $0.09/hr on Vast.ai are perfect for batch inference, internal tools, and development environments where brief interruptions are acceptable.
  • Scale to zero when idle. If your traffic has predictable patterns, shut down GPU instances during off-peak hours. You are paying per hour, not per token — idle GPUs are wasted money.

The Verdict

Self-hosted inference is 10-200x cheaper than commercial APIs at production scale. The exact multiplier depends on your GPU choice, utilization rate, and the API you are comparing against. The crossover point where self-hosting becomes cheaper is roughly 50,000-100,000 tokens per hour sustained. Below that threshold, use an API for simplicity. Above it, self-hosting delivers savings that can transform your cost structure.

Start with a quantized 70B model on a spot A100 at $0.09/hr. Benchmark your throughput with vLLM. Calculate your cost per token. Then compare that number to your current API bill. I am confident the result will surprise you. Use our GPU price comparison tool to find the cheapest GPU for your model size.

Stay ahead on GPU pricing

Get weekly GPU price reports, new hardware analysis, and cost optimization tips. Join engineers and researchers who save thousands on cloud compute.

No spam. Unsubscribe anytime. We respect your inbox.

Find the cheapest GPU for your workload

Compare real-time prices across tracked cloud providers and marketplaces with 5,000+ instances. Updated every 6 hours.

Compare GPU Prices →

Related Articles