The average cloud GPU is idle 73% of the time it is being billed. That is not a guess — it is based on utilization data from GPU cloud providers and infrastructure monitoring tools. If you are paying for a GPU 24/7 but only running inference or training jobs during work hours, you are burning 73 cents of every dollar. Here is how to stop.
Where the 73% Comes From
The typical GPU utilization pattern breaks down like this:
| Time Period | Hours/Week | GPU Util % | Effective Hours |
|---|---|---|---|
| Active training/inference (business hours) | 40 | 65% | 26 |
| Idle during business hours | 40 | 35% | 14 |
| Nights + weekends (GPU still running) | 128 | 5% | 6.4 |
| Total | 168 | 27.6% | 46.4 |
You are paying for 168 hours but only using 46.4 effective GPU-hours. That is 72.4% waste. On an H100 at $1.87/hr, that is $229/week going straight to the cloud provider for zero compute.
Fix #1: Stop Paying for GPUs When You Sleep
The single biggest waste: leaving GPUs running 24/7 when you only work 8-10 hours/day. Here is the math on an H100:
- 24/7: $1.87 * 730 hrs/mo = $1,365/mo
- 10hrs/day, weekdays: $1.87 * 220 hrs/mo = $411/mo
- Savings: $954/mo (70%)
How to implement: On RunPod, stop pods when not in use — you only pay for network volume storage ($0.07/GB/mo). On Lambda Labs, terminate instances and re-create from a snapshot. On AWS, stop the instance (EBS persists at ~$0.08/GB/mo). Set up a cron job or use your CI/CD pipeline to auto-stop GPUs at EOD.
Fix #2: Right-Size Your GPU
If your GPU utilization during active hours is under 50%, you are running on a GPU that is too big. Common mistakes:
- Using an H100 for 7B inference — an RTX 4090 at $0.39/hr does the same job 4.8x cheaper
- Using an A100 80GB for a model that uses 20GB VRAM — an L40S at $0.69/hr has enough VRAM and costs 37% less
- Using 4x GPUs when 1x is enough — multi-GPU adds overhead and most inference workloads are single-GPU
Fix #3: Quantize Before You Rent
Quantization is the single highest-ROI optimization in GPU computing. The impact:
| Model (70B) | Precision | VRAM Needed | Cheapest GPU | Price/hr |
|---|---|---|---|---|
| Llama 3 70B | FP16 | ~140GB | 2x H100 80GB | $2.58+ |
| Llama 3 70B | FP8 | ~70GB | 1x H100 80GB | $1.29 |
| Llama 3 70B | GPTQ 4-bit | ~35GB | 1x L40S 48GB | $0.69 |
Going from FP16 to GPTQ 4-bit on a 70B model drops your GPU cost from $2.58/hr to $0.69/hr — a 73% cost reduction with less than 2% quality loss on most benchmarks. This is free money.
Fix #4: Use Spot Instances for Fault-Tolerant Work
If your workload can handle interruptions — batch inference, training with checkpoints, evaluation runs — spot instances cut costs 40-70%:
- H100 on-demand: $1.87/hr → H100 spot: $0.73/hr (-61%)
- A100 on-demand: $1.10/hr → A100 spot: $0.34/hr (-69%)
- RTX 4090 on-demand: $0.39/hr → RTX 4090 spot: $0.19/hr (-51%)
Fix #5: Batch Your Requests
If you are running inference one request at a time, you are using maybe 10-20% of the GPU's capacity. vLLM, TGI, and other serving frameworks support continuous batching. The impact is dramatic:
- Batch=1 on H100: ~105 tok/s → $4.95/1M tokens
- Batch=8 on H100: ~600 tok/s → $0.87/1M tokens
- Batch=32 on H100: ~1,800 tok/s → $0.29/1M tokens
Going from batch=1 to batch=32 is a 17x cost reduction for the same GPU. If you have enough concurrent users or can queue requests, this is the single most impactful optimization.
The Combined Impact
Let us combine all five fixes for a 70B inference workload:
| Optimization | Monthly Cost | Savings |
|---|---|---|
| Baseline: 2x H100, FP16, 24/7, batch=1 | $2,730 | — |
| + Quantize to 4-bit (1x L40S) | $504 | -82% |
| + Run only 10hrs/day weekdays | $152 | -94% |
| + Use spot instance | $57 | -98% |
From $2,730/month to $57/month. A 98% reduction. That is the difference between "GPU costs are killing us" and "GPU costs are a rounding error."
Find the right GPU for your optimized workload: Use our GPU price comparison to find the cheapest spot instances and right-sized GPUs across all providers.