Mistral Model Family — Local Requirements

Mistral 7B v0.3

5 GB

GTX 1080 Ti / M1 Mac

Daily driver

Mistral Small 3 (22B)

14 GB

RTX 3090 / 4080

Balanced quality

Mixtral 8×7B (MoE)

26 GB

RTX 3090 × 2 / 4090

High quality

Mixtral 8×22B (MoE)

70 GB

2× A100 80GB

Cloud recommended

Mistral Large 2

123 GB

4× A100 / H100 ×2

Cloud only

Mistral AI has the most practical open-weight model family for local deployment. Mistral 7B runs on a laptop with a 6 GB GPU. Mixtral 8×7B uses a sparse MoE architecture — it has 47B parameters total but only activates 13B per token, making it fast despite its size. This guide covers all three practical local options.

Method 1: Ollama (Recommended)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Mistral 7B — runs on anything with 6 GB VRAM
ollama run mistral

# Mistral Small 3 (22B) — better quality
ollama run mistral-small3

# Mixtral 8x7B — MoE, needs 26 GB VRAM
ollama run mixtral

# Mistral Nemo (12B, multilingual)
ollama run mistral-nemo

# Codestral (coding specialist)
ollama run codestral

Method 2: vLLM (Production Inference Server)

For serving Mistral to multiple users or building an API backend, vLLM offers continuous batching and much higher throughput than Ollama:

pip install vllm

# Serve Mistral 7B with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --port 8000 \
  --max-model-len 32768

# Serve Mixtral 8x7B (needs 2× A100 or equivalent)
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tensor-parallel-size 2 \
  --port 8000

Method 3: LM Studio (No Terminal)

# 1. Download LM Studio from lmstudio.ai
# 2. Open → Discover → search "mistral"
# 3. Pick model size based on your VRAM
# 4. Click Load → Start chatting
#
# For the API:
# Enable "Local Server" in LM Studio settings
# Then use the OpenAI SDK against http://localhost:1234

from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lmstudio")
response = client.chat.completions.create(
    model="mistral-7b-instruct",
    messages=[{"role":"user","content":"Hello!"}]
)

Mistral vs Alternatives: When to Pick Each

Use Case	Best Model	Why
General chat	Mistral 7B	Fast, runs on anything
Code generation	Codestral 22B	Trained on code, beats 7B by wide margin
Multilingual	Mistral Nemo	Best multilingual in 12B class
High quality (local)	Mixtral 8x7B	Near-GPT-3.5 quality at 26 GB VRAM
Production API	Mistral Large 2	Top-tier, use Mistral cloud for cost

Cloud Option for Mixtral 8×22B

Mixtral 8×22B needs 70 GB VRAM. That's 2× A100 80GB at ~$3.78/hr on Lambda Labs, or a single H100 80GB at $2.49/hr on RunPod with quantization. For occasional inference, use Together AI or Fireworks for pay-per-token.

→ A100 Prices → RTX 3090 Cloud → H100 Instances → Best GPU for LLM Inference → All Providers

How to Run Mistral Models Locally

Method 1: Ollama (Recommended)

Method 2: vLLM (Production Inference Server)

Method 3: LM Studio (No Terminal)

Mistral vs Alternatives: When to Pick Each

Cloud Option for Mixtral 8×22B

Related Articles

How to Run Llama 4 Locally (Scout + Maverick)

How to Run DeepSeek R1 Locally (No GPU Required)

How to Run Gemma 4 Locally (Text, Audio, Image)