Skip to main content
productioninferencedeployment

Deploying LLMs to Production: A GPU Cost Optimization Guide

Serving a 7B model to 1000 users costs $200-2000/mo depending on your setup. We break down the math for every architecture choice.

January 28, 202513 min read

You've fine-tuned a model, it works great in your notebook, and now you need to serve it to real users. This is where most teams discover that the gap between "it works on my machine" and "it works in production at scale" is measured in both engineering effort and monthly cloud bills. The cost of serving an LLM to production traffic can vary by 10x or more depending on your GPU choice, serving framework, batching strategy, and quantization decisions. I've seen teams spend $2,000/month serving a 7B model that could run for $200/month with the right architecture. This guide covers every decision point with real numbers so you can build a production stack that doesn't bankrupt you.

The Key Question: How Many Concurrent Users Can 1 GPU Handle?

Before you can estimate costs, you need to know how much throughput a single GPU delivers. This depends on three things: the GPU's memory bandwidth (which determines token generation speed), the GPU's VRAM (which determines how many concurrent requests fit in memory), and your serving framework (which determines how efficiently you utilize both). Here are realistic throughput numbers for common GPU/model combinations using vLLM with continuous batching, which is the current state-of-the-art for LLM serving.

ModelGPUConcurrent UsersTokens/sec (total)$/hr
7B FP16RTX 4090 (24GB)10-15400-600$0.39
7B FP16L40S (48GB)20-30600-900$0.88
7B FP16H100 (80GB)40-601,200-2,000$1.87
13B FP16L40S (48GB)8-12300-500$0.88
13B FP16H100 (80GB)25-35800-1,200$1.87
70B FP16H100 (80GB)5-8150-250$1.87
70B FP16H200 (141GB)10-15300-500$1.84

The "concurrent users" column is the number of simultaneous streaming requests the GPU can handle while maintaining acceptable per-user token generation speed (roughly 20-40 tokens/sec per user for a good chat experience). These numbers assume vLLM with continuous batching enabled, which is critical — without it, throughput drops by 5-10x.

Cost Per 1,000 Daily Active Users: Worked Examples

Let's work through the math for a concrete scenario. You have a product with 1,000 daily active users (DAU). Each user makes an average of 3 requests per day, with each request generating approximately 200 output tokens. That's 3,000 requests/day = 600,000 tokens/day.

The peak load is what matters for GPU sizing. If 70% of your traffic happens in an 8-hour window (typical for B2B products), your peak is roughly 2,100 requests in 8 hours = 262 requests/hour = ~4.4 requests/minute. With an average generation time of 5 seconds per request (200 tokens at 40 tokens/sec), peak concurrency is about 0.37 simultaneous requests. That's right — for 1,000 DAU with this usage pattern, you rarely have more than 1 simultaneous request. A single RTX 4090 can handle this easily.

SetupGPUGPUs NeededMonthly CostCost/User/Mo
7B on RTX 4090$0.39/hr1$281/mo$0.28
7B on L40S$0.88/hr1$634/mo$0.63
7B on H100$1.87/hr1$1,346/mo$1.35
7B on AWS H100$8.46/hr1$6,091/mo$6.09

The right choice for 1,000 DAU serving a 7B model: the RTX 4090 at $281/mo. It's 4.8x cheaper than the H100 and 21.7x cheaper than the AWS H100, while comfortably handling the peak load. The H100 only makes sense if you expect to scale to 5,000+ DAU quickly or you need the extra concurrency headroom. Many teams default to the biggest GPU and massively overpay for their actual traffic levels.

Scaling to 10K, 100K, and Beyond

At higher traffic levels, your architecture choices matter more. Here's how costs scale for a 7B model serving different traffic levels, assuming the same usage pattern (3 requests/user/day, 200 tokens/request, 70% of traffic in 8 hours).

DAUPeak ConcurrencyRTX 4090 (count / cost)H100 (count / cost)RunPod Serverless
100~0.041 / $2811 / $1,346~$15
1,000~0.41 / $2811 / $1,346~$150
10,000~41 / $2811 / $1,346~$1,500
100,000~404 / $1,1241 / $1,346~$15,000

Several important patterns emerge. First, a single RTX 4090 can handle up to about 10,000 DAU for a 7B model with this usage pattern. Most startups never need more than one GPU. Second, RunPod Serverless (pay-per-second, only when processing) is the cheapest option for low traffic (under ~1,000 DAU) because you're not paying for idle time. But at higher traffic levels, self-managed becomes dramatically cheaper because utilization increases and you're no longer paying the serverless markup. Third, 100K DAU only needs 4 RTX 4090s at $1,124/month — still less than a single H100. The RTX 4090 is the more cost-efficient choice until your concurrency exceeds 10-15 per GPU.

Scaling Patterns: Horizontal vs Vertical vs Autoscaling

Horizontal Scaling: Multiple Identical GPUs

The simplest scaling pattern. Run multiple identical GPU instances, each serving the same model, behind a load balancer. Every GPU is independent — no communication between them. If one dies, the others keep serving. To scale up, add more GPUs. To scale down, remove them. This is the right default for most teams. It's operationally simple, each GPU is a commodity that can be replaced, and you can mix providers (e.g., 2 GPUs on RunPod and 2 on Lambda for redundancy). The only complexity is the load balancer, which can be as simple as an NGINX reverse proxy with round-robin routing, or a managed load balancer from your cloud provider.

Vertical Scaling: Bigger GPUs

Instead of 4 RTX 4090s, use 1 H100 that handles 4x the concurrency. Less infrastructure to manage — one machine, one model instance, one failure point. This is attractive for small teams that don't want to manage a multi-instance deployment. The tradeoff: no redundancy (if the GPU dies, you're down), less cost-efficient (H100 is more expensive per concurrent user than RTX 4090 for most workloads), and you're limited by the maximum capability of a single GPU. Vertical scaling is a dead end — you'll eventually need to go horizontal.

Autoscaling: Dynamic GPU Count Based on Traffic

The holy grail for cost optimization. Run 1 GPU during off-peak hours and scale to 4 during peak hours. This requires: an autoscaler that monitors request queue depth or response latency, the ability to provision new GPU instances in seconds to minutes, and a load balancer that routes traffic to new instances as they come online. The challenge is cold start time. Spinning up a new GPU instance and loading a 7B model takes 30-90 seconds on most providers. During that time, your existing GPUs are overloaded and response latency degrades. Strategies to mitigate this include keeping a "warm pool" of pre-provisioned instances (defeats the cost savings) or using a serverless platform like RunPod Serverless that maintains warm instances for you (adds a cost premium). Autoscaling delivers the best ROI for traffic patterns with clear peaks and valleys — think B2B products with daytime usage and minimal nighttime traffic.

The Serving Framework Decision: vLLM vs TGI vs Triton

Your choice of serving framework has a bigger impact on cost than your choice of GPU. The right framework can serve 5-10x more concurrent users on the same hardware. Here's the honest comparison.

vLLM: The Default Choice

vLLM is the current standard for LLM serving, and for good reason. Its PagedAttention algorithm manages KV cache memory like an operating system manages virtual memory — dynamically allocating and freeing memory blocks as requests come and go. This means no VRAM is wasted on pre-allocated KV cache slots. Combined with continuous batching (processing new requests as they arrive rather than waiting for the current batch to complete), vLLM typically achieves 3-5x higher throughput than naive serving approaches. Setup is simple: pip install vllm and run the OpenAI-compatible API server with one command. It supports streaming, beam search, quantized models (GPTQ, AWQ, bitsandbytes), and most Hugging Face model architectures out of the box. Use vLLM unless you have a specific reason not to.

TGI (Text Generation Inference): The Hugging Face Ecosystem

Hugging Face's Text Generation Inference is a Rust-based serving framework that offers similar performance to vLLM with tighter Hugging Face integration. It supports continuous batching, quantization, flash attention, and streaming. TGI shines if you're already deeply embedded in the Hugging Face ecosystem and want seamless integration with their model hub, tokenizers, and inference endpoints. Performance is competitive with vLLM — within 10-15% on most benchmarks. The main downside is that it's slightly less flexible than vLLM for custom models and non-standard architectures.

Triton Inference Server: The Multi-Model Production Server

NVIDIA's Triton Inference Server is designed for production environments serving multiple models simultaneously. It supports model versioning, A/B testing, dynamic batching, and GPU sharing across models. If you're serving an LLM alongside an embedding model, a reranker, and an image model on the same GPU, Triton's model management capabilities are unmatched. The tradeoff: it's significantly more complex to set up and configure than vLLM or TGI. For single-model LLM serving, the added complexity isn't worth it. For multi-model production systems, it's the right tool.

Batching Is Everything

The single most impactful optimization for LLM serving throughput is batching — processing multiple requests simultaneously on the same GPU. Without batching, a 7B model on an H100 generates about 60-80 tokens/sec for a single user. That means 1 concurrent user. With continuous batching enabled in vLLM, the same GPU serves 40-60 concurrent users at 30-40 tokens/sec each, for a total throughput of 1,200-2,000 tokens/sec. That's a 15-30x increase in throughput for free — same GPU, same model, just better scheduling.

Why does batching help so much? LLM inference is memory-bandwidth-bound, not compute-bound. The GPU spends most of its time reading model weights from VRAM, and the compute units are mostly idle. When you batch multiple requests, the model weights are read once from VRAM and used for all requests in the batch. The compute units — which were previously idle — now do useful work. The memory bandwidth cost is amortized across all batched requests. This is why continuous batching (vLLM's approach) is so effective: it fills every GPU cycle with useful work instead of letting the GPU sit idle between requests.

Quantization for Production: The 70% Cost Reduction

Quantization reduces model weight precision from FP16 (16 bits per weight) to INT8 (8 bits) or INT4 (4 bits), cutting VRAM usage by 2-4x. This lets you fit larger models on cheaper GPUs, or serve more concurrent requests on the same GPU by freeing VRAM for KV cache.

ModelFP16 VRAMINT8 VRAMINT4 VRAMINT4 Quality Loss
7B~14GB~7GB~4GB<3%
13B~26GB~13GB~7GB<2%
30B~60GB~30GB~16GB<2%
70B~140GB~70GB~38GB<2%

The practical impact: a 13B model quantized to INT4 (7GB) fits on an RTX 4090 (24GB) instead of requiring an L40S (48GB). That changes your hourly cost from $0.88 to $0.39 — a 56% reduction — with less than 2% quality degradation on standard benchmarks. For production systems where cost matters more than achieving the absolute maximum quality, quantization is a no-brainer. Use AWQ or GPTQ for static quantization (fastest inference), or bitsandbytes for dynamic quantization (easier setup, slightly slower).

RunPod Serverless vs Self-Managed: When to Use Each

RunPod Serverless charges only for the seconds your GPU is actively processing requests. When there's no traffic, there's no charge (or a minimal idle fee if you keep workers warm). This is ideal for bursty traffic — products that get 100 requests in one hour and zero in the next. You're not paying for the dead time.

Self-managed (running your own GPU instance 24/7) is cheaper when your utilization exceeds roughly 40-50%. If your GPU is actively processing requests more than half the time, the per-second serverless premium exceeds the cost of just keeping a GPU running constantly. For steady-traffic products with consistent daily usage patterns, self-managed wins. For early-stage products with unpredictable traffic, serverless wins.

The crossover point depends on your traffic level. For 100 DAU (very low traffic), RunPod Serverless costs roughly $15/month — dramatically cheaper than a self-managed GPU at $281/month because the GPU would be idle 99%+ of the time. For 10,000 DAU, self-managed at $281/month beats serverless at ~$1,500/month because utilization is high enough to justify a dedicated GPU.

The Production Deployment Checklist

Before you go live, make sure you've addressed each of these points. I've seen teams skip most of these and end up with production fires that cost more than the GPU itself.

  • Use vLLM or TGI with continuous batching enabled. Without it, your throughput drops 5-10x and your cost per user increases proportionally.
  • Quantize to INT8 or INT4 unless you have a proven quality reason not to. The VRAM savings let you use cheaper GPUs or serve more concurrent requests.
  • Set max_model_len in vLLM to your actual maximum context length. The default often allocates KV cache for the model's maximum context (e.g., 128K tokens), wasting VRAM. If your requests never exceed 4K tokens, set it to 4096.
  • Implement health checks and automatic restarts. GPU processes crash. CUDA OOM errors happen. Your serving process should be wrapped in a supervisor (systemd, Docker restart policy, Kubernetes liveness probe) that detects failures and restarts automatically.
  • Monitor GPU utilization, VRAM usage, request latency, and queue depth. If GPU utilization is consistently above 80%, you need another GPU. If it's below 20%, you're overpaying. If queue depth is growing, your throughput is insufficient for your traffic.
  • Set request timeouts. A single malicious or accidental request for 100K tokens can block the GPU for minutes. Set a maximum generation length (e.g., 2048 tokens) and a request timeout (e.g., 60 seconds) to prevent runaway requests.
  • Implement rate limiting per user. Without rate limits, a single user hammering your API can degrade the experience for everyone. Limit concurrent requests and tokens/minute per API key.
  • Plan for GPU failure. If you have one GPU and it fails, your service is down. For production systems, run at least 2 GPUs behind a load balancer so that one can fail without downtime. This costs 2x but provides redundancy.

The Bottom Line: Matching Architecture to Traffic

Most teams serving LLMs to production are overpaying because they chose their GPU and architecture based on vibes rather than math. The actual decision process should be:

  • Under 1,000 DAU: Use RunPod Serverless. Don't manage any GPUs. Pay per request. Monthly cost: $15-150.
  • 1,000-10,000 DAU: Self-managed single RTX 4090 with vLLM. Quantize to INT4 if it helps. Monthly cost: $281.
  • 10,000-50,000 DAU: 2-3 RTX 4090s behind a load balancer for both throughput and redundancy. Monthly cost: $562-843.
  • 50,000-100,000 DAU: 3-5 RTX 4090s or 1-2 H100s with autoscaling. Monthly cost: $843-2,692.
  • 100,000+ DAU: Multi-GPU horizontal scaling with autoscaling, health checks, and multi-provider redundancy. Hire an infra engineer. Monthly cost: $1,500+.

Notice that even at 100,000 DAU, the monthly GPU cost for a 7B model is under $3,000 with the right architecture. Teams paying $10,000+/month for the same workload are choosing wrong GPUs, wrong providers, or wrong serving frameworks. Use our GPU price comparison tool to find the cheapest GPU that meets your VRAM and throughput requirements, then apply the architecture patterns in this guide to minimize your monthly bill.

Stay ahead on GPU pricing

Get weekly GPU price reports, new hardware analysis, and cost optimization tips. Join engineers and researchers who save thousands on cloud compute.

No spam. Unsubscribe anytime. We respect your inbox.

Find the cheapest GPU for your workload

Compare real-time prices across tracked cloud providers and marketplaces with 5,000+ instances. Updated every 6 hours.

Compare GPU Prices →

Related Articles

We use cookies for analytics and to remember your preferences. Privacy Policy