Skip to main content
guidellmlocal

How to Run Llama 4 Locally (Scout + Maverick)

Step-by-step guide to running Llama 4 Scout and Maverick locally with Ollama. VRAM requirements, benchmarks, and API setup.

April 10, 20268 min read
Llama 4 Local Requirements at a Glance
12 GB VRAM
Scout (17B)
80 GB VRAM
Maverick (400B)
8 GB VRAM
Scout (Q4)
RTX 4090 / H100
Recommended GPU

Meta released Llama 4 with two open-weight variants: Scout (17B active parameters, 16 experts MoE) and Maverick (400B total, 17B active per token). Scout runs comfortably on a single RTX 4090 at 4-bit quantization. Maverick needs a multi-GPU rig or a cloud instance.

This guide walks through running both models locally with Ollama — the easiest path for most developers. If your GPU doesn't have enough VRAM, we also cover CPU offload and cloud alternatives.

Requirements

ComponentScout (min)Maverick (min)
VRAM8 GB (Q4)80 GB (Q4, multi-GPU)
RAM16 GB64 GB
GPURTX 3080 / 40702× A100 80GB or H100
Storage10 GB free120 GB free
OSmacOS / Linux / WindowsLinux recommended

Step 1: Install Ollama

Ollama is a single-binary tool that handles model downloads, quantization selection, and serving. Install it with one command:

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from https://ollama.com/download
# Verify installation
ollama --version

Step 2: Pull and Run Llama 4 Scout

Scout is the practical choice for most local setups. The default pull uses Q4_K_M quantization (~9 GB):

# Pull and run Scout (interactive chat)
ollama run llama4:scout

# Or pull first, run later
ollama pull llama4:scout
ollama run llama4:scout "Explain transformers in simple terms"

# For higher quality (needs 12 GB VRAM):
ollama run llama4:scout-q8

Step 3: Run Llama 4 Maverick (Multi-GPU)

Maverick requires multi-GPU or a cloud instance. Ollama automatically distributes layers across available GPUs:

# Ensure CUDA is visible for multi-GPU
export CUDA_VISIBLE_DEVICES=0,1

# Pull Maverick (Q4, ~100 GB download)
ollama pull llama4:maverick

# Run with increased context window
OLLAMA_NUM_CTX=32768 ollama run llama4:maverick

Step 4: Use the REST API

Ollama exposes an OpenAI-compatible REST API on port 11434. You can use it with any OpenAI SDK:

# Direct API call
curl http://localhost:11434/api/generate \
  -d '{"model":"llama4:scout","prompt":"Hello!"}'

# OpenAI SDK (Python)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="llama4:scout",
    messages=[{"role":"user","content":"Hello!"}]
)
print(response.choices[0].message.content)

Performance Benchmarks

GPUScout tokens/sCloud cost
RTX 4090 (24 GB)~45 t/s$0.74/hr on RunPod
RTX 3080 (10 GB)~22 t/s (Q4 only)$0.30/hr
A100 80GB~90 t/s$1.89/hr on Lambda
H100 SXM~140 t/s$2.49/hr on RunPod
M3 Max (48 GB unified)~35 t/slocal only

No GPU? Use CPU Offload or Cloud

If you don't have a capable GPU, Ollama can run Scout on CPU with RAM offload — expect 2–4 tokens/s. For Maverick on CPU, performance is impractical. The better option is a cloud GPU at $0.74–$2.49/hr for on-demand inference.

# Force CPU only (no GPU)
OLLAMA_NUM_GPU=0 ollama run llama4:scout

# Limit layers on GPU (partial offload)
OLLAMA_NUM_GPU=20 ollama run llama4:scout

Stay ahead on GPU pricing

Get weekly GPU price reports, new hardware analysis, and cost optimization tips. Join engineers and researchers who save thousands on cloud compute.

No spam. Unsubscribe anytime. We respect your inbox.

Find the cheapest GPU for your workload

Compare real-time prices across tracked cloud providers and marketplaces with 5,000+ instances. Updated every 6 hours.

Compare GPU Prices →

Related Articles

We use cookies for analytics and to remember your preferences. Privacy Policy