Llama 4 Local Requirements at a Glance

12 GB VRAM

Scout (17B)

80 GB VRAM

Maverick (400B)

8 GB VRAM

Scout (Q4)

RTX 4090 / H100

Recommended GPU

Meta released Llama 4 with two open-weight variants: Scout (17B active parameters, 16 experts MoE) and Maverick (400B total, 17B active per token). Scout runs comfortably on a single RTX 4090 at 4-bit quantization. Maverick needs a multi-GPU rig or a cloud instance.

This guide walks through running both models locally with Ollama — the easiest path for most developers. If your GPU doesn't have enough VRAM, we also cover CPU offload and cloud alternatives.

Requirements

Component	Scout (min)	Maverick (min)
VRAM	8 GB (Q4)	80 GB (Q4, multi-GPU)
RAM	16 GB	64 GB
GPU	RTX 3080 / 4070	2× A100 80GB or H100
Storage	10 GB free	120 GB free
OS	macOS / Linux / Windows	Linux recommended

Step 1: Install Ollama

Ollama is a single-binary tool that handles model downloads, quantization selection, and serving. Install it with one command:

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from https://ollama.com/download
# Verify installation
ollama --version

Step 2: Pull and Run Llama 4 Scout

Scout is the practical choice for most local setups. The default pull uses Q4_K_M quantization (~9 GB):

# Pull and run Scout (interactive chat)
ollama run llama4:scout

# Or pull first, run later
ollama pull llama4:scout
ollama run llama4:scout "Explain transformers in simple terms"

# For higher quality (needs 12 GB VRAM):
ollama run llama4:scout-q8

Step 3: Run Llama 4 Maverick (Multi-GPU)

Maverick requires multi-GPU or a cloud instance. Ollama automatically distributes layers across available GPUs:

# Ensure CUDA is visible for multi-GPU
export CUDA_VISIBLE_DEVICES=0,1

# Pull Maverick (Q4, ~100 GB download)
ollama pull llama4:maverick

# Run with increased context window
OLLAMA_NUM_CTX=32768 ollama run llama4:maverick

Step 4: Use the REST API

Ollama exposes an OpenAI-compatible REST API on port 11434. You can use it with any OpenAI SDK:

# Direct API call
curl http://localhost:11434/api/generate \
  -d '{"model":"llama4:scout","prompt":"Hello!"}'

# OpenAI SDK (Python)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="llama4:scout",
    messages=[{"role":"user","content":"Hello!"}]
)
print(response.choices[0].message.content)

Performance Benchmarks

GPU	Scout tokens/s	Cloud cost
RTX 4090 (24 GB)	~45 t/s	$0.74/hr on RunPod
RTX 3080 (10 GB)	~22 t/s (Q4 only)	$0.30/hr
A100 80GB	~90 t/s	$1.89/hr on Lambda
H100 SXM	~140 t/s	$2.49/hr on RunPod
M3 Max (48 GB unified)	~35 t/s	local only

No GPU? Use CPU Offload or Cloud

If you don't have a capable GPU, Ollama can run Scout on CPU with RAM offload — expect 2–4 tokens/s. For Maverick on CPU, performance is impractical. The better option is a cloud GPU at $0.74–$2.49/hr for on-demand inference.

# Force CPU only (no GPU)
OLLAMA_NUM_GPU=0 ollama run llama4:scout

# Limit layers on GPU (partial offload)
OLLAMA_NUM_GPU=20 ollama run llama4:scout

→ H100 Prices Live → RTX 4090 Cloud → A100 80GB Instances → Best GPU for LLM Inference → Compare All Providers

How to Run Llama 4 Locally (Scout + Maverick)

Requirements

Step 1: Install Ollama

Step 2: Pull and Run Llama 4 Scout

Step 3: Run Llama 4 Maverick (Multi-GPU)

Step 4: Use the REST API

Performance Benchmarks

No GPU? Use CPU Offload or Cloud

Related Articles

How to Run DeepSeek R1 Locally (No GPU Required)

How to Run Gemma 4 Locally (Text, Audio, Image)

How to Run Qwen 3 Locally with Ollama