Self-Hosted vs API: The Exact Breakeven Point for Every Model Size

"Should I self-host or use an API?" This is the single most expensive decision in AI infrastructure, and most teams get it wrong because they never do the actual math. I did the math. For every major model size — 8B, 13B, 30B, 70B — I calculated the exact breakeven point where self-hosting becomes cheaper than using a hosted API. The answer is not what most people expect.

The Cost Comparison Framework

For each model size, I compared: (1) the cheapest cloud GPU that can run it with reasonable performance, (2) the equivalent hosted API price. The self-hosted cost includes the GPU rental, storage, and a 15% "ops tax" for the time you spend managing infrastructure. The API cost is pure per-token pricing.

Model	Self-Hosted GPU	Self-Hosted $/1M tok	API Equivalent	API $/1M tok	Breakeven
Llama 3 8B	RTX 4090 @ $0.39/hr	$1.52	Together.ai / Groq	$0.20	Never*
Llama 3 70B	H100 @ $1.29/hr	$5.80	Together.ai / Fireworks	$0.90	Never*
GPT-4 class (custom)	8x H100 @ $10.32/hr	$8.20	OpenAI GPT-4o	$2.50	Never*
Fine-tuned 8B	RTX 4090 @ $0.39/hr	$1.52	Custom model hosting	$3.00+	Day 1
Fine-tuned 70B	H100 @ $1.29/hr	$5.80	Custom model hosting	$12.00+	Day 1

The Uncomfortable Truth: APIs Win for Standard Models

If you are running standard open-source models without fine-tuning, self-hosting almost never makes economic sense. API providers like Together.ai, Groq, and Fireworks have massive GPU clusters with batch sizes of 64-256, serving thousands of users simultaneously. Their per-token cost is lower because they amortize the GPU cost across all their customers. You cannot compete with that at startup scale.

The asterisk (*) on "Never" above: self-hosting beats APIs when your GPU utilization exceeds ~70%. At 1M+ tokens/hour sustained, you start to approach the economies of scale that make self-hosting viable. For most startups, that is Series A scale, not seed stage.

When Self-Hosting Wins Immediately

Self-hosting is the obvious choice when:

You fine-tuned the model. No API provider serves your custom weights. You have to host it yourself. At $1.52/1M tokens for a fine-tuned 8B vs. $3.00+ on custom model hosting platforms, self-hosting is 2x cheaper from day one.
You need data privacy. Medical data, legal documents, financial records — if it cannot leave your infrastructure, self-hosting is the only option. The premium is worth it.
You need customized inference. Speculative decoding, custom KV cache management, structured generation with outlines — if you need to modify the inference pipeline, you need your own GPU.
You have latency requirements under 50ms TTFT. API round trips add 50-200ms of network latency. Self-hosted inference on a local GPU is 10-30ms TTFT.

The Decision Framework

Use an API if: You are running a standard model, under 1M tokens/hour, and do not need custom inference or data privacy.

Self-host if: You fine-tuned the model, need data privacy, need custom inference, or are processing 1M+ tokens/hour sustained.

Start with API, switch later: This is the right answer for 80% of startups. Use an API to validate your product, then self-host when you have enough volume to justify the infrastructure.

Ready to self-host? Find the cheapest GPU for your model size on our GPU price comparison. Filter by VRAM to find GPUs that can handle your model.

Self-Hosted vs API: The Exact Breakeven Point for Every Model Size

The Cost Comparison Framework

The Uncomfortable Truth: APIs Win for Standard Models

When Self-Hosting Wins Immediately

The Decision Framework

Related Articles

Elon Web Services: What SpaceX's $15B Anthropic Deal Means for Cloud GPU Pricing

Cheapest GPU Cloud in 2026: 54 Providers Ranked

Best GPU for Stable Diffusion: Cloud Setup Guide