Skip to main content
l40shardwareguide

The L40S Is the Most Underrated GPU in the Cloud

48GB VRAM, Ada Lovelace architecture, from $0.26/hr spot. The L40S handles 13B inference and fine-tuning at a fraction of A100 prices.

February 1, 20258 min read

Here's a controversial take: the NVIDIA L40S is the best value GPU you can rent in the cloud right now, and almost nobody is talking about it. The AI discourse is dominated by H100 hype and A100 nostalgia, while the L40S quietly sits there with 48GB of VRAM, Ada Lovelace architecture, and spot prices starting at $0.26/hr. It's the GPU equivalent of a Toyota Camry — not flashy, not the fastest on the track, but it handles 80% of real workloads at a price that makes financial sense. If you're running inference, fine-tuning models in the 7B-13B range, or doing any kind of image/video generation, you should seriously consider the L40S before reaching for an A100 or H100.

L40S Specifications: What You're Getting

The L40S is NVIDIA's data center GPU built on the Ada Lovelace architecture (the same generation as the RTX 4090). It's designed primarily for inference and graphics workloads, but its specs make it surprisingly capable for training too.

SpecL40SA100 80GBRTX 4090H100 80GB
VRAM48GB GDDR6X80GB HBM2e24GB GDDR6X80GB HBM3
Memory Bandwidth864 GB/s2,039 GB/s1,008 GB/s3,350 GB/s
FP16 TFLOPS362312330990
FP8 TFLOPS724N/A6601,979
ArchitectureAda LovelaceAmpereAda LovelaceHopper
Form FactorPCIeSXM / PCIePCIeSXM / PCIe
TDP350W300-400W450W700W

A few things jump out immediately. The L40S has more FP16 TFLOPS than the A100 (362 vs 312) despite being a generation newer and significantly cheaper. It supports FP8 with 724 TFLOPS — a feature the A100 doesn't have at all. And at 48GB of VRAM, it sits in a sweet spot between the RTX 4090's 24GB (too small for many production workloads) and the A100's 80GB (overkill for most inference).

What 48GB of VRAM Gets You

VRAM is the gating factor for most ML workloads. Here's exactly what fits in the L40S's 48GB and what doesn't.

  • 13B model inference at FP16 (~26GB): Fits easily with 22GB of headroom for KV cache and large batch sizes. You can serve 10-15 concurrent requests with room to spare. This is the bread-and-butter use case for the L40S.
  • 30B model inference with 4-bit quantization (~20GB): Runs great. GPTQ or AWQ quantization compresses a 30B model to ~18-20GB, leaving ample room for the KV cache. Quality degradation at 4-bit for 30B+ models is minimal — typically under 2% on standard benchmarks.
  • 7B full fine-tune with LoRA adapters (~35-40GB): A 7B model in FP16 takes ~14GB for weights. With optimizer states (Adam needs 2x the model size), gradients, and activations, a full fine-tune needs ~40GB. LoRA reduces this to ~20-25GB because you only maintain optimizer states for the adapter weights. Fits comfortably on an L40S.
  • Stable Diffusion XL with ControlNet + adapters (~18-25GB): SDXL base model uses ~7GB in FP16. Add ControlNet (~2.5GB), IP-Adapter (~2GB), LoRA adapters, and the VAE decoder, and you're at 15-20GB. No VRAM pressure on the L40S, even with multiple ControlNet models loaded simultaneously.
  • 70B model inference with 4-bit quantization (~38-42GB): This is tight but possible. A 70B model at 4-bit quantization compresses to ~35-38GB, leaving only 10-13GB for the KV cache. With limited context windows and low concurrency, it works. For production serving, you'll want the A100 80GB for breathing room.

L40S vs A100 80GB: The Value Argument

The A100 80GB has been the default recommendation for serious ML work since 2021. It has two clear advantages over the L40S: 80GB vs 48GB VRAM and HBM2e at 2 TB/s vs GDDR6X at 864 GB/s memory bandwidth. The extra VRAM matters for 70B+ models and large-batch training. The higher memory bandwidth matters for inference, where token generation speed is memory-bandwidth-bound.

But the L40S has newer architecture with FP8 support (724 TFLOPS vs zero on the A100), more FP16 compute (362 vs 312 TFLOPS), and costs dramatically less. An A100 80GB on-demand starts at $0.34/hr on the low end and runs $1.10-3.67/hr on most providers. The L40S starts at $0.88/hr on-demand and $0.26/hr spot. For any workload that fits in 48GB — which is most inference and fine-tuning — the L40S delivers comparable or better compute performance at a lower price. The A100 only wins when you need the extra VRAM or the higher memory bandwidth for large-model inference throughput.

L40S vs RTX 4090: Double the VRAM

The L40S and RTX 4090 share the same Ada Lovelace architecture, so their raw compute performance is similar. The RTX 4090 actually has slightly higher memory bandwidth (1,008 GB/s vs 864 GB/s) due to its 384-bit memory bus. But the L40S has double the VRAM: 48GB vs 24GB. This doubles the size of models you can serve, the batch sizes you can use, and the flexibility of your VRAM allocation. The RTX 4090 starts at $0.34/hr on RunPod, while the L40S starts at $0.88/hr on-demand. You pay roughly 2.5x more for 2x the VRAM — a reasonable trade for workloads that need it. If your model fits in 24GB with room for the KV cache, stick with the 4090. If you need more VRAM, the L40S is the cheapest path to 48GB in the cloud.

L40S vs H100: The Price-Performance Reality

The H100 is unquestionably the faster GPU. It has 2.7x more FP16 TFLOPS (990 vs 362), 3.9x more memory bandwidth (3,350 vs 864 GB/s), and 80GB of HBM3 vs 48GB of GDDR6X. For training throughput, especially with large batch sizes and FP8, the H100 is in a different league. But the L40S at $0.26/hr spot vs the H100 at $0.73/hr spot is 2.8x cheaper. And the H100 is not 2.8x faster for most real-world workloads. For inference of models that fit in 48GB, the L40S delivers 50-70% of the H100's throughput at 35% of the price. The math clearly favors the L40S for cost-sensitive deployments. You'd need to be genuinely compute-bound (large-scale training, high-throughput inference serving) to justify the H100 premium.

Best Use Cases for the L40S

  • Inference servers for 7B-13B models: The sweet spot. 48GB handles these models in FP16 with plenty of room for batching. Cost-efficient and performant.
  • Fine-tuning 7B-13B models: LoRA and QLoRA fine-tuning fits comfortably. Full fine-tuning of 7B models is possible with careful memory management.
  • Image and video generation: SDXL, Flux, and video diffusion models run well. The 48GB VRAM handles complex multi-adapter pipelines without VRAM pressure.
  • Research and experimentation: At $0.26/hr spot, you can experiment all day for under $3. This makes iteration cheap enough that you can try more ideas.
  • Quantized 30B-70B inference: With 4-bit quantization, models up to 70B fit in 48GB (though 70B is tight). Great for cost-sensitive deployments where maximum quality isn't required.

Worst Use Cases for the L40S

  • 70B+ models at full precision: A 70B model in FP16 needs ~140GB of VRAM. Even quantized, you're pushing the 48GB limit with no room for KV cache at high concurrency. Use an A100 80GB, H100, or multi-GPU setup.
  • Large-scale distributed training: The L40S is PCIe-only with no NVLink support. Multi-GPU communication bottlenecks on the PCIe bus. For multi-node training, use H100 SXM with NVLink.
  • Workloads that need HBM bandwidth: The L40S's 864 GB/s GDDR6X bandwidth is 2.3x slower than the A100's HBM2e and 3.9x slower than the H100's HBM3. For bandwidth-sensitive workloads like large-batch inference, this matters.

Where to Rent an L40S

ProviderOn-DemandSpotNotes
RunPod$0.94/hr$0.26/hrBest spot price, per-second billing
Latitude.sh$0.88/hrBest on-demand price, bare metal
Hyperstack$0.95/hrDedicated instances, good availability
AWS$2.94/hr~$0.88/hrg6e instances, enterprise features
GCP$2.68/hr~$0.80/hrg2 instances, good integration

My recommendation: If you need an L40S for development and experimentation, use RunPod spot at $0.26/hr. For production inference that needs to stay up 24/7, use Latitude.sh at $0.88/hr for the best combination of price and reliability. Only use AWS/GCP if you need the L40S integrated into your existing hyperscaler infrastructure.

The Bottom Line

The L40S is the GPU that nobody talks about but everybody should be using. It occupies the perfect middle ground: enough VRAM for serious workloads (48GB), modern architecture with FP8 support, and prices that make experimentation and production serving financially viable. The A100 has more VRAM and bandwidth but costs more and lacks FP8. The RTX 4090 is cheaper but has half the VRAM. The H100 is faster but costs 2.8-7x more. For 80% of inference and fine-tuning workloads in the 7B-30B parameter range, the L40S is the right choice — and it's not even close on price-performance.

Check the live L40S pricing across all providers on our comparison tool.

Stay ahead on GPU pricing

Get weekly GPU price reports, new hardware analysis, and cost optimization tips. Join engineers and researchers who save thousands on cloud compute.

No spam. Unsubscribe anytime. We respect your inbox.

Find the cheapest GPU for your workload

Compare real-time prices across tracked cloud providers and marketplaces with 5,000+ instances. Updated every 6 hours.

Compare GPU Prices →

Related Articles