Every GPU cloud provider advertises a clean hourly price. $1.87/hr for an H100. $0.39/hr for an RTX 4090. You multiply by the hours you need, budget accordingly, and then your actual bill arrives 2-3x higher than you projected. This isn't a billing error — it's the hidden costs that every provider buries in their pricing pages but none of them advertise on their landing pages. After helping dozens of teams audit their GPU spend, I can tell you that the hourly rate typically accounts for only 40-60% of your real monthly cost. The rest is storage, data transfer, idle time, and a handful of gotchas that catch even experienced engineers off guard.
This guide breaks down every line item that shows up on a real GPU cloud bill, gives you the formulas to calculate your actual monthly cost before you spin up a single instance, and shows you how to cut 30-50% off your bill with straightforward operational changes. No fluff — just the math.
The Hidden Cost Breakdown
Your GPU cloud bill has five major components. Most people only budget for the first one.
1. Compute (The Advertised Price)
This is the number everyone knows: the per-hour cost of the GPU instance. It's what you see on pricing pages and comparison tools. For an H100 80GB, this ranges from $1.87/hr on Cudo Compute to $8.46/hr on AWS p5 instances. For an RTX 4090, it's $0.34/hr on RunPod to $0.74/hr on Lambda. But this number is only meaningful if your GPU utilization is 100% — which it never is.
2. Storage ($0.08–0.20/GB/month)
Cloud providers charge for persistent storage whether your instance is running or not. SSD-backed block storage typically runs $0.08–0.20/GB/month depending on the provider and storage tier. This sounds cheap until you realize what a typical ML workflow actually stores: your training dataset (50-500GB), model checkpoints saved every N steps (each checkpoint for a 7B model is ~14GB in FP16, and you might keep 10-20 of them), the base model weights (another 14-140GB depending on model size), plus your code, logs, and virtual environment. A realistic storage footprint for a 7B model fine-tuning project is 200-500GB, costing $16-100/month. For 70B model training with frequent checkpointing, you can easily hit 2-5TB, adding $160-1,000/month to your bill.
3. Data Egress ($0.08–0.12/GB)
Uploading data to the cloud is usually free. Downloading it costs money. AWS charges $0.09/GB for data egress, GCP charges $0.12/GB, and Azure charges $0.087/GB. Marketplace providers like RunPod and Vast.ai typically include some egress in their pricing or charge reduced rates. During a typical training run, you'll download: final model weights (14-140GB), training logs and metrics (1-5GB), evaluation results and generated samples (1-10GB), and possibly intermediate checkpoints. For a 7B model fine-tune, that's roughly 20-50GB of egress, costing $1.60-6.00 per run. For a 70B model or a run with many checkpoints, egress can hit 200-500GB, costing $16-60.
4. Idle Time (The Silent Killer)
This is the single biggest hidden cost, and it's entirely behavioral. You spin up an H100 instance at 9am to start a training run. The data needs preprocessing — 45 minutes of CPU work while the GPU sits idle. Training runs for 4 hours. You check the results, tweak hyperparameters — another 30 minutes of idle GPU. Run training again for 3 hours. Check results, go to lunch, come back, run one more experiment. By 6pm you've had the instance running for 9 hours but the GPU was actually training for only 7. That's 22% idle time, and you paid for every second of it. In practice, most teams see 40-60% GPU utilization on interactive development instances. The GPU is idle during data loading, preprocessing, debugging, code changes, meetings, lunch breaks, and the time between "training finished" and "I noticed and started the next run." At $1.87/hr for an H100, 40% idle time means you're paying $0.75/hr for nothing.
5. Billed-When-Stopped Fees
Here's a gotcha that catches people: on AWS, GCP, and Azure, stopping a GPU instance doesn't stop all charges. You stop paying for the compute, but you keep paying for attached EBS/Persistent Disk volumes, elastic IPs, and any reserved capacity. If you have a 1TB EBS gp3 volume attached to a stopped p5 instance, you're paying about $80/month for storage alone even though the instance isn't running. Some marketplace providers handle this better — RunPod charges for storage on stopped pods at a reduced rate, and Vast.ai only charges when your instance is running (but your data is wiped when you terminate). Know your provider's policy before you assume "stopped" means "free."
Worked Example: "I Need an H100 for 8 Hours of Training"
Let's walk through a realistic scenario. You need to fine-tune a 13B model, and you estimate 8 hours of GPU training time on an H100.
| Cost Item | Naive Estimate | Realistic Cost |
|---|---|---|
| GPU compute (8 hrs @ $1.87/hr) | $14.96 | — |
| Actual instance uptime (12 hrs idle/setup) | — | $22.44 |
| 200GB SSD storage (monthly, prorated) | $0 | $16.00 |
| 50GB data egress @ $0.09/GB | $0 | $4.50 |
| Data upload (100GB, free inbound) | $0 | $0 |
| Total | $14.96 | $42.94 |
The real cost is 2.87x the naive estimate. And this is a conservative scenario. If you keep your storage volume attached for a full month, forget to shut down your instance overnight once, or need to re-run training after a bug, the multiplier climbs to 3-4x easily. Your mental model of GPU costs needs to include everything around the GPU, not just the GPU itself.
The Real Cost Formula
Here's the formula I use to estimate actual monthly GPU costs. It's not perfect, but it gets you within 15% of your real bill, which is a lot better than the naive calculation.
True Monthly Cost = (GPU $/hr x Daily Hours x 30 / Utilization Rate) + (Storage GB x $/GB/mo) + (Monthly Egress GB x $/GB) + Stopped Instance Fees
The utilization rate is the key variable most people ignore. If you're running automated training pipelines with proper scheduling, you might hit 80-90% utilization. If you're doing interactive development — SSH into a box, run experiments manually, iterate — plan for 40-60%. If you're a team sharing a persistent instance, utilization can drop to 20-30% because only one person is typically using the GPU at a time while the others do CPU work.
Monthly Cost Comparison for Real Workloads
Here's what different workloads actually cost per month across different providers, including all hidden costs. These numbers assume realistic utilization rates and include storage and egress estimates.
| Workload | RunPod | Latitude | Cudo | AWS |
|---|---|---|---|---|
| Dev/testing (4 hrs/day, RTX 4090) | ~$56/mo | ~$68/mo | ~$62/mo | ~$145/mo |
| Inference server (24/7, L40S) | ~$680/mo | ~$633/mo | ~$710/mo | ~$2,200/mo |
| Training (8 hrs/day, H100) | ~$520/mo | ~$580/mo | ~$448/mo | ~$2,030/mo |
The AWS column includes storage and egress at standard rates. The marketplace provider columns assume minimal storage fees and reduced or included egress. The difference is staggering: for the training workload, you're paying 4.5x more on AWS than on Cudo Compute for the same H100 GPU. Use our TMC (Total Monthly Cost) panel to run these numbers for your specific workload.
10 Ways to Cut Your GPU Bill by 30-70%
Use Spot Instances for Fault-Tolerant Work
Spot/preemptible instances save 50-73% on compute costs. H100 spot instances start at $0.73/hr vs $1.87/hr on-demand. The catch: your instance can be terminated with little or no warning. This is fine for training jobs with checkpointing — you lose at most the work since your last checkpoint. It's not fine for inference servers that need to be always-on. Set up your training script to save checkpoints every 15-30 minutes and auto-resume from the latest checkpoint on startup.
Script Your Shutdowns
The simplest optimization: add a shutdown command at the end of your training script. After training completes, the instance terminates automatically. No more paying for 8 hours of idle time overnight because you forgot to shut it down. On Linux, append sudo shutdown -h now to the end of your training script, or use the provider's API to terminate the instance programmatically. RunPod's API lets you stop pods with a single REST call. For extra safety, set a cron job that checks GPU utilization every 10 minutes and shuts down the instance if it has been idle for 30+ minutes.
Per-Second vs Per-Hour Billing
AWS charges per-second with a 60-second minimum. GCP charges per-second with a 1-minute minimum. RunPod and Vast.ai charge per-second with no minimum. But many providers round up to the nearest hour. If your training job takes 2 hours and 5 minutes on a per-hour provider, you pay for 3 hours. On a per-second provider, you pay for 2 hours and 5 minutes. Over many short jobs, this difference compounds significantly. For iterative development with lots of short runs, per-second billing saves 10-25%.
Store Data on Object Storage, Not Attached Volumes
S3, R2, and GCS cost $0.02-0.03/GB/month — 4-10x cheaper than SSD block storage. Keep your training datasets and final model weights on object storage. Only use attached SSD volumes for active training (checkpoints, cache, temporary files). When training finishes, push results to object storage and terminate the instance. Cloudflare R2 is particularly attractive because it has zero egress fees, which eliminates the data transfer cost entirely. The tradeoff is latency: reading from S3 is slower than local SSD, so copy data to local storage at the start of training, not during.
Checkpoint Frequently on Spot Instances
If you're using spot instances and your instance gets preempted after 6 hours of training without a checkpoint, you've lost 6 hours of compute and the money you spent on it. Checkpointing every 15-30 minutes means you lose at most 30 minutes of work. Save checkpoints to object storage (not local disk) so they survive instance termination. The cost of checkpoint storage is negligible compared to the cost of re-running lost training.
Right-Size Your GPU
Running a 7B model inference server on an H100 is like driving a semi truck to get groceries. An RTX 4090 at $0.34/hr handles 7B inference perfectly and costs 5.5x less than an H100 at $1.87/hr. Before you spin up the most powerful GPU, check if your workload actually needs it. Run a quick benchmark on a cheaper GPU first. If a 7B model fits in 24GB VRAM and your throughput requirements are modest, the RTX 4090 is the right choice. Use our comparison tool to find the cheapest GPU that meets your VRAM requirements.
Use Quantization to Fit on Cheaper GPUs
A 13B model in FP16 needs ~26GB of VRAM and requires an L40S ($0.88/hr) or A100 ($1.10/hr). The same model quantized to INT4 needs ~7GB and fits on an RTX 4090 ($0.34/hr) with room to spare. Quality loss from 4-bit quantization is typically under 3% on standard benchmarks for models 7B and larger. That's a 70% cost reduction for a barely measurable quality difference. Tools like GPTQ, AWQ, and bitsandbytes make quantization trivial — it takes minutes to quantize a model.
Avoid Multi-GPU When Single-GPU Suffices
Multi-GPU setups (2x, 4x, 8x) are not linearly efficient. Communication overhead between GPUs means 2x GPUs give you roughly 1.7-1.8x the throughput. If your model fits on a single GPU, using one GPU is always more cost-efficient. Only reach for multi-GPU when your model genuinely doesn't fit in the VRAM of a single card, or when your batch size is so large that you need to split it across GPUs for memory reasons.
Negotiate Volume Discounts
If you're spending more than $2,000/month with a single provider, contact their sales team. Most providers offer 10-30% discounts for committed usage. Lambda offers reserved pricing, CoreWeave has committed-use contracts, and even marketplace providers will negotiate for large accounts. A 20% discount on a $5,000/month bill saves $12,000/year.
Monitor and Alert on Idle GPUs
Set up monitoring that tracks GPU utilization and alerts you when a GPU has been idle for more than 30 minutes. NVIDIA's nvidia-smi can be polled via cron, and most providers have APIs to check instance status. A simple Slack webhook that pings you when GPU utilization drops below 5% for 30 minutes can save hundreds of dollars a month by catching forgotten instances.
The Bottom Line
Your real GPU cloud bill is your hourly rate divided by your utilization rate, plus storage, plus egress, plus any stopped-instance fees. For most teams, this means the advertised price is 40-60% of what you actually pay. The good news is that most of the hidden costs are operational, not structural — you can eliminate them with better practices rather than switching providers. Script your shutdowns, use spot instances for training, store data on object storage, right-size your GPUs, and monitor for idle instances.
Use our GPU price comparison tool with the TMC panel to calculate your true monthly cost before committing to a provider. Knowing the real number before you start is worth more than any optimization after you've already been overpaying for months.