Meta released Llama 4 with two open-weight variants: Scout (17B active parameters, 16 experts MoE) and Maverick (400B total, 17B active per token). Scout runs comfortably on a single RTX 4090 at 4-bit quantization. Maverick needs a multi-GPU rig or a cloud instance.
This guide walks through running both models locally with Ollama — the easiest path for most developers. If your GPU doesn't have enough VRAM, we also cover CPU offload and cloud alternatives.
Requirements
| Component | Scout (min) | Maverick (min) |
|---|---|---|
| VRAM | 8 GB (Q4) | 80 GB (Q4, multi-GPU) |
| RAM | 16 GB | 64 GB |
| GPU | RTX 3080 / 4070 | 2× A100 80GB or H100 |
| Storage | 10 GB free | 120 GB free |
| OS | macOS / Linux / Windows | Linux recommended |
Step 1: Install Ollama
Ollama is a single-binary tool that handles model downloads, quantization selection, and serving. Install it with one command:
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download installer from https://ollama.com/download
# Verify installation
ollama --versionStep 2: Pull and Run Llama 4 Scout
Scout is the practical choice for most local setups. The default pull uses Q4_K_M quantization (~9 GB):
# Pull and run Scout (interactive chat)
ollama run llama4:scout
# Or pull first, run later
ollama pull llama4:scout
ollama run llama4:scout "Explain transformers in simple terms"
# For higher quality (needs 12 GB VRAM):
ollama run llama4:scout-q8Step 3: Run Llama 4 Maverick (Multi-GPU)
Maverick requires multi-GPU or a cloud instance. Ollama automatically distributes layers across available GPUs:
# Ensure CUDA is visible for multi-GPU
export CUDA_VISIBLE_DEVICES=0,1
# Pull Maverick (Q4, ~100 GB download)
ollama pull llama4:maverick
# Run with increased context window
OLLAMA_NUM_CTX=32768 ollama run llama4:maverickStep 4: Use the REST API
Ollama exposes an OpenAI-compatible REST API on port 11434. You can use it with any OpenAI SDK:
# Direct API call
curl http://localhost:11434/api/generate \
-d '{"model":"llama4:scout","prompt":"Hello!"}'
# OpenAI SDK (Python)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama4:scout",
messages=[{"role":"user","content":"Hello!"}]
)
print(response.choices[0].message.content)Performance Benchmarks
| GPU | Scout tokens/s | Cloud cost |
|---|---|---|
| RTX 4090 (24 GB) | ~45 t/s | $0.74/hr on RunPod |
| RTX 3080 (10 GB) | ~22 t/s (Q4 only) | $0.30/hr |
| A100 80GB | ~90 t/s | $1.89/hr on Lambda |
| H100 SXM | ~140 t/s | $2.49/hr on RunPod |
| M3 Max (48 GB unified) | ~35 t/s | local only |
No GPU? Use CPU Offload or Cloud
If you don't have a capable GPU, Ollama can run Scout on CPU with RAM offload — expect 2–4 tokens/s. For Maverick on CPU, performance is impractical. The better option is a cloud GPU at $0.74–$2.49/hr for on-demand inference.
# Force CPU only (no GPU)
OLLAMA_NUM_GPU=0 ollama run llama4:scout
# Limit layers on GPU (partial offload)
OLLAMA_NUM_GPU=20 ollama run llama4:scout