How Much VRAM Do You Actually Need? A Practical Guide

VRAM is the single most important spec when renting a GPU for AI workloads, and most people get it wrong in one of two directions: they either rent a 80 GB A100 to run Stable Diffusion (overpaying by 5x), or they try to squeeze a 13B model onto a 16 GB T4 and spend half their time debugging out-of-memory crashes. This guide maps every common AI workload to the minimum VRAM tier that can handle it, along with the cheapest GPU at that tier.

The VRAM Tier Breakdown

Think of VRAM in tiers, not individual numbers. Each tier unlocks a specific set of workloads, and within each tier the cheapest GPU is your target.

Tier 1: 8–16 GB (T4 $0.07/hr spot, RTX 3070 $0.07/hr spot)

The budget tier. At seven cents per hour, you can run Stable Diffusion 1.5 (~4 GB), small model inference (GPT-2, BERT, DistilBERT), embedding generation for RAG pipelines, and basic image classification. This is also where you'd run quantized 7B models with aggressive 4-bit GPTQ — they squeeze into 4–6 GB, though the KV cache limits your context window. If you're just learning ML or running lightweight production inference, this tier is shockingly cheap.

Tier 2: 24 GB (RTX 4090 $0.17/hr spot, L4 $0.88/hr on-demand)

The sweet spot for individual developers. 24 GB is enough for 7B FP16 inference (~14 GB), LoRA fine-tuning of 7B models (~20 GB), Stable Diffusion XL (~7 GB), and 13B models with 4-bit quantization (~8 GB plus KV cache). The RTX 4090 at $0.17/hr spot on Vast.ai is the king of this tier. If your workload fits in 24 GB, you probably shouldn't be spending more. The L4 at $0.88/hr is the "safe" option if you need on-demand reliability without spot interruptions.

Tier 3: 48 GB (L40S $0.88/hr, A6000 ~$0.50/hr)

The awkward middle child that's actually quite useful. 48 GB handles 13B FP16 inference (~26 GB), full fine-tuning of 7B models (~60 GB doesn't fit — use LoRA at this tier or step up to 80 GB), 30B quantized inference with GPTQ 4-bit (~18 GB plus KV cache), and multi-model serving (two 7B models side by side). The L40S on Latitude at $0.88/hr is strong here. The A6000 is slightly cheaper but slower on tensor operations.

Tier 4: 80 GB (A100 80GB $0.34/hr, H100 $1.87/hr)

The workhorse tier for serious ML work. 80 GB handles 30B FP16 inference (~60 GB), 70B quantized inference with GPTQ 4-bit (~38 GB), full fine-tuning of 7B models (~60 GB), LoRA fine-tuning of 13B–30B models, and batch inference serving multiple concurrent users. The A100 80 GB at $0.34/hr on Vultr is absurdly good value. The H100 is faster but costs 5.5x more — only worth it when you need the throughput.

Tier 5: 96–141 GB (GH200 $1.99/hr, H200 $1.84/hr)

The "run 70B in a single GPU" tier. The H200 with 141 GB of HBM3e can fit a 70B FP16 model (140 GB) in a single card — no tensor parallelism, no multi-GPU communication overhead. This is game-changing for 70B inference latency. The GH200 with 96 GB can handle 70B in 4-bit quantization (~38 GB) with generous KV cache for long-context tasks, or 30B fine-tuning. At $1.84/hr on Vast.ai, the H200 is actually cheaper than the H100 while being strictly better.

Tier 6: 180 GB+ (B200 180GB, MI300X 192GB)

The frontier tier. These cards are for 70B full fine-tuning (which needs ~300 GB, so you still need multi-GPU), 100B+ model inference, and massive batch inference workloads. The MI300X with 192 GB of HBM3 is AMD's play for the datacenter market. Availability is still limited, but if you can get one, the VRAM-per-dollar ratio is competitive. This tier is for teams, not individuals.

The Rule of Thumb

Quick math: Model parameters x 2 bytes (FP16) = minimum VRAM for inference. Add 50% for fine-tuning overhead (optimizer states, activations, gradients). So a 7B model needs ~14 GB for inference and ~21 GB for LoRA fine-tuning. A 70B model needs ~140 GB for inference and ~210 GB for LoRA fine-tuning. These are minimums — add headroom for KV cache and batch size.

Workload → GPU Quick Reference

Workload	Min VRAM	Cheapest GPU	Price
Stable Diffusion	8 GB	T4 (spot)	$0.07/hr
7B inference (FP16)	16 GB	RTX 4090 (spot)	$0.17/hr
7B LoRA fine-tune	20 GB	RTX 4090	$0.39/hr
13B inference (FP16)	28 GB	A6000 (48GB)	~$0.50/hr
30B inference (quantized)	20 GB	RTX 4090	$0.39/hr
70B inference (quantized)	40 GB	L40S (48GB)	$0.88/hr
70B inference (FP16)	140 GB	H200 (141GB)	$1.84/hr

The Quantization Escape Hatch

Before you rent a bigger GPU, ask yourself: can I quantize? GPTQ and AWQ 4-bit quantization lets you run models at roughly 25% of the FP16 VRAM with minimal quality loss (typically 1–3% degradation on benchmarks). That means a 70B model that needs 140 GB in FP16 can squeeze into 35–40 GB in 4-bit — which fits on a single 48 GB L40S. You just saved yourself a jump from $0.88/hr to $1.84/hr.

The quality tradeoff is real but often acceptable. For chatbot inference, creative writing, and code generation, 4-bit quantized models are practically indistinguishable from FP16. For tasks requiring precise numerical reasoning or factual recall, you might notice degradation. Test before committing.

The Controversial Take

80 GB is the new minimum for serious ML work. If you're doing anything beyond hobbyist inference — fine-tuning, multi-model serving, batch processing, working with models larger than 7B — you will constantly hit VRAM walls below 80 GB. The A100 80 GB at $0.34/hr on Vultr makes this tier more accessible than ever. Yes, you can survive on 24–48 GB with quantization tricks and careful memory management. But the engineering time you spend fighting VRAM limits is worth more than the $0.50/hr difference between GPU tiers. Check our comparison tool to see how affordable 80 GB has become.