Skip to main content
pricinginferenceanalysis

Self-Hosted vs API: The Exact Breakeven Point for Every Model Size

We calculated when self-hosting becomes cheaper than APIs for 8B, 70B, and GPT-4 class models. For standard models, APIs win. For fine-tuned models, self-hosting wins from day one.

February 19, 202610 min read

"Should I self-host or use an API?" This is the single most expensive decision in AI infrastructure, and most teams get it wrong because they never do the actual math. I did the math. For every major model size — 8B, 13B, 30B, 70B — I calculated the exact breakeven point where self-hosting becomes cheaper than using a hosted API. The answer is not what most people expect.

The Cost Comparison Framework

For each model size, I compared: (1) the cheapest cloud GPU that can run it with reasonable performance, (2) the equivalent hosted API price. The self-hosted cost includes the GPU rental, storage, and a 15% "ops tax" for the time you spend managing infrastructure. The API cost is pure per-token pricing.

ModelSelf-Hosted GPUSelf-Hosted $/1M tokAPI EquivalentAPI $/1M tokBreakeven
Llama 3 8BRTX 4090 @ $0.39/hr$1.52Together.ai / Groq$0.20Never*
Llama 3 70BH100 @ $1.29/hr$5.80Together.ai / Fireworks$0.90Never*
GPT-4 class (custom)8x H100 @ $10.32/hr$8.20OpenAI GPT-4o$2.50Never*
Fine-tuned 8BRTX 4090 @ $0.39/hr$1.52Custom model hosting$3.00+Day 1
Fine-tuned 70BH100 @ $1.29/hr$5.80Custom model hosting$12.00+Day 1

The Uncomfortable Truth: APIs Win for Standard Models

If you are running standard open-source models without fine-tuning, self-hosting almost never makes economic sense. API providers like Together.ai, Groq, and Fireworks have massive GPU clusters with batch sizes of 64-256, serving thousands of users simultaneously. Their per-token cost is lower because they amortize the GPU cost across all their customers. You cannot compete with that at startup scale.

The asterisk (*) on "Never" above: self-hosting beats APIs when your GPU utilization exceeds ~70%. At 1M+ tokens/hour sustained, you start to approach the economies of scale that make self-hosting viable. For most startups, that is Series A scale, not seed stage.

When Self-Hosting Wins Immediately

Self-hosting is the obvious choice when:

  • You fine-tuned the model. No API provider serves your custom weights. You have to host it yourself. At $1.52/1M tokens for a fine-tuned 8B vs. $3.00+ on custom model hosting platforms, self-hosting is 2x cheaper from day one.
  • You need data privacy. Medical data, legal documents, financial records — if it cannot leave your infrastructure, self-hosting is the only option. The premium is worth it.
  • You need customized inference. Speculative decoding, custom KV cache management, structured generation with outlines — if you need to modify the inference pipeline, you need your own GPU.
  • You have latency requirements under 50ms TTFT. API round trips add 50-200ms of network latency. Self-hosted inference on a local GPU is 10-30ms TTFT.

The Decision Framework

Use an API if: You are running a standard model, under 1M tokens/hour, and do not need custom inference or data privacy.

Self-host if: You fine-tuned the model, need data privacy, need custom inference, or are processing 1M+ tokens/hour sustained.

Start with API, switch later: This is the right answer for 80% of startups. Use an API to validate your product, then self-host when you have enough volume to justify the infrastructure.

Ready to self-host? Find the cheapest GPU for your model size on our GPU price comparison. Filter by VRAM to find GPUs that can handle your model.

Stay ahead on GPU pricing

Get weekly GPU price reports, new hardware analysis, and cost optimization tips. Join engineers and researchers who save thousands on cloud compute.

No spam. Unsubscribe anytime. We respect your inbox.

Find the cheapest GPU for your workload

Compare real-time prices across tracked cloud providers and marketplaces with 5,000+ instances. Updated every 6 hours.

Compare GPU Prices →

Related Articles