Choosing a GPU for fine-tuning is simultaneously the most important and most confusing decision in your ML pipeline. Get it wrong and you either blow your budget on hardware you don't need, or — worse — you spend three days watching a training run that keeps crashing with CUDA out-of-memory errors. The good news is that the decision is mostly mechanical once you understand VRAM requirements. This guide maps every common model size to the cheapest GPU that can handle it.
VRAM Requirements: The Only Number That Matters (First)
VRAM is the primary constraint. Compute speed and cost matter too, but they're irrelevant if your model doesn't fit in memory. Here's the breakdown for the most common fine-tuning configurations, using real 2025 cloud prices:
| Model Size | Method | VRAM Needed | Best GPU | Price/hr |
|---|---|---|---|---|
| 7B | Full fine-tune | ~60 GB | A100 80GB | $0.34/hr (Vultr) |
| 7B | LoRA | ~20 GB | RTX 4090 (24GB) | $0.39/hr (CloudRift) |
| 13B | LoRA | ~30 GB | L40S (48GB) | $0.88/hr (Latitude) |
| 30B | LoRA | ~50 GB | A100 80GB | $0.34/hr (Vultr) |
| 70B | LoRA | ~80 GB | H200 (141GB) | $1.84/hr (Vast.ai) |
| 70B | Full fine-tune | ~300 GB | 4x H100 | $7.48/hr |
Notice the pattern: VRAM requirements jump dramatically when you move from LoRA to full fine-tuning. Full fine-tuning stores the full optimizer states (Adam keeps two copies of every parameter), so a 7B model that needs 14 GB of VRAM for inference suddenly demands 60 GB for training. LoRA sidesteps this by only training a small adapter — typically 1–5% of the parameters — so the optimizer overhead is negligible. This is why LoRA has become the default for most people. You can check current prices on our comparison tool to find the latest rates.
The Criminally Underrated Option: QLoRA + RTX 4090
Hot take: QLoRA on an RTX 4090 is the most cost-effective way to fine-tune a 7B model in 2025, and almost nobody talks about it. QLoRA quantizes the base model to 4-bit and trains LoRA adapters in FP16. A 7B model in 4-bit is roughly 4 GB. Add LoRA adapters and optimizer states and you're at about 10–12 GB. That leaves 12 GB of headroom on the 4090's 24 GB — enough for generous batch sizes. At $0.39/hr on-demand or $0.17/hr spot, you can fine-tune a 7B model for under $2. Try doing that on an H100 at $1.87/hr.
The Decision Hierarchy: VRAM → Speed → Cost
When choosing a GPU for fine-tuning, apply these filters in order:
- VRAM first. Calculate your model's VRAM requirement (see table above). Filter out every GPU that doesn't have enough. This is a hard constraint — there's no workaround besides quantization or multi-GPU setups.
- Compute speed second. Among the GPUs that have enough VRAM, compare their TFLOPS and memory bandwidth. An A100 will train faster than an A6000 at the same VRAM tier (80 GB) because of its higher HBM bandwidth.
- Cost third. Among GPUs with enough VRAM and adequate speed, pick the cheapest one. This is where our comparison tool shines — sort by price after filtering for your minimum VRAM.
Batch Size: The Hidden VRAM Multiplier
The VRAM numbers above assume modest batch sizes (4–8). If you want to run batch size 32 for faster convergence, your activation memory balloons. As a rough rule, doubling batch size adds 20–40% more VRAM depending on sequence length. This is why the "~20 GB for 7B LoRA" can quickly become 30+ GB if you're aggressive with batch size and sequence length. When in doubt, start with the smallest batch size that trains stably and increase only if you have VRAM headroom.
Multi-GPU Training: Worth It?
The tempting thought is: "I can't afford one H100, so I'll rent two RTX 4090s and use data parallelism." In practice, multi-GPU training introduces communication overhead — gradient synchronization across PCIe or NVLink — that eats into your throughput. For 7B models, a single A100 80 GB is faster and cheaper than 2x RTX 4090s because you eliminate cross-device communication entirely. Multi-GPU only makes sense when your model physically cannot fit on a single GPU, which means 30B+ parameters for full fine-tuning or 70B+ for LoRA.
If you do need multi-GPU, prioritize instances with NVLink or NVSwitch interconnects. PCIe-connected multi-GPU setups have 5–10x less inter-GPU bandwidth, and you'll feel it in training throughput. Check the instance specs carefully — not all "4x H100" offerings use NVLink.
Practical Advice for Getting Started
Here's the honest playbook:
- Start with the cheapest GPU that fits your model plus data in VRAM. Use QLoRA if it gets you onto a cheaper GPU tier.
- Run a short training experiment (100–500 steps) to measure throughput and estimate total training time.
- If the estimate is under 48 hours, you're done. Ship it.
- If the estimate is over 48 hours, scale up to a faster GPU at the same VRAM tier, or switch to a multi-GPU setup if you're VRAM-bound.
- Never start with the biggest GPU. You'll overpay for 90% of fine-tuning jobs.
The fine-tuning GPU market in 2025 is extraordinarily competitive. An A100 80 GB at $0.34/hr on Vultr would have been unthinkable two years ago. Take advantage of the price war. Check the trends page to see how prices have been falling, and lock in spot instances when you can. Your fine-tuning bill should be measured in single-digit dollars for 7B models, not hundreds.