"Should I self-host or use an API?" This is the single most expensive decision in AI infrastructure, and most teams get it wrong because they never do the actual math. I did the math. For every major model size — 8B, 13B, 30B, 70B — I calculated the exact breakeven point where self-hosting becomes cheaper than using a hosted API. The answer is not what most people expect.
The Cost Comparison Framework
For each model size, I compared: (1) the cheapest cloud GPU that can run it with reasonable performance, (2) the equivalent hosted API price. The self-hosted cost includes the GPU rental, storage, and a 15% "ops tax" for the time you spend managing infrastructure. The API cost is pure per-token pricing.
| Model | Self-Hosted GPU | Self-Hosted $/1M tok | API Equivalent | API $/1M tok | Breakeven |
|---|---|---|---|---|---|
| Llama 3 8B | RTX 4090 @ $0.39/hr | $1.52 | Together.ai / Groq | $0.20 | Never* |
| Llama 3 70B | H100 @ $1.29/hr | $5.80 | Together.ai / Fireworks | $0.90 | Never* |
| GPT-4 class (custom) | 8x H100 @ $10.32/hr | $8.20 | OpenAI GPT-4o | $2.50 | Never* |
| Fine-tuned 8B | RTX 4090 @ $0.39/hr | $1.52 | Custom model hosting | $3.00+ | Day 1 |
| Fine-tuned 70B | H100 @ $1.29/hr | $5.80 | Custom model hosting | $12.00+ | Day 1 |
The Uncomfortable Truth: APIs Win for Standard Models
If you are running standard open-source models without fine-tuning, self-hosting almost never makes economic sense. API providers like Together.ai, Groq, and Fireworks have massive GPU clusters with batch sizes of 64-256, serving thousands of users simultaneously. Their per-token cost is lower because they amortize the GPU cost across all their customers. You cannot compete with that at startup scale.
The asterisk (*) on "Never" above: self-hosting beats APIs when your GPU utilization exceeds ~70%. At 1M+ tokens/hour sustained, you start to approach the economies of scale that make self-hosting viable. For most startups, that is Series A scale, not seed stage.
When Self-Hosting Wins Immediately
Self-hosting is the obvious choice when:
- You fine-tuned the model. No API provider serves your custom weights. You have to host it yourself. At $1.52/1M tokens for a fine-tuned 8B vs. $3.00+ on custom model hosting platforms, self-hosting is 2x cheaper from day one.
- You need data privacy. Medical data, legal documents, financial records — if it cannot leave your infrastructure, self-hosting is the only option. The premium is worth it.
- You need customized inference. Speculative decoding, custom KV cache management, structured generation with outlines — if you need to modify the inference pipeline, you need your own GPU.
- You have latency requirements under 50ms TTFT. API round trips add 50-200ms of network latency. Self-hosted inference on a local GPU is 10-30ms TTFT.
The Decision Framework
Use an API if: You are running a standard model, under 1M tokens/hour, and do not need custom inference or data privacy.
Self-host if: You fine-tuned the model, need data privacy, need custom inference, or are processing 1M+ tokens/hour sustained.
Start with API, switch later: This is the right answer for 80% of startups. Use an API to validate your product, then self-host when you have enough volume to justify the infrastructure.
Ready to self-host? Find the cheapest GPU for your model size on our GPU price comparison. Filter by VRAM to find GPUs that can handle your model.