Mistral AI has the most practical open-weight model family for local deployment. Mistral 7B runs on a laptop with a 6 GB GPU. Mixtral 8×7B uses a sparse MoE architecture — it has 47B parameters total but only activates 13B per token, making it fast despite its size. This guide covers all three practical local options.
Method 1: Ollama (Recommended)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Mistral 7B — runs on anything with 6 GB VRAM
ollama run mistral
# Mistral Small 3 (22B) — better quality
ollama run mistral-small3
# Mixtral 8x7B — MoE, needs 26 GB VRAM
ollama run mixtral
# Mistral Nemo (12B, multilingual)
ollama run mistral-nemo
# Codestral (coding specialist)
ollama run codestralMethod 2: vLLM (Production Inference Server)
For serving Mistral to multiple users or building an API backend, vLLM offers continuous batching and much higher throughput than Ollama:
pip install vllm
# Serve Mistral 7B with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--port 8000 \
--max-model-len 32768
# Serve Mixtral 8x7B (needs 2× A100 or equivalent)
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--tensor-parallel-size 2 \
--port 8000Method 3: LM Studio (No Terminal)
# 1. Download LM Studio from lmstudio.ai
# 2. Open → Discover → search "mistral"
# 3. Pick model size based on your VRAM
# 4. Click Load → Start chatting
#
# For the API:
# Enable "Local Server" in LM Studio settings
# Then use the OpenAI SDK against http://localhost:1234
from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lmstudio")
response = client.chat.completions.create(
model="mistral-7b-instruct",
messages=[{"role":"user","content":"Hello!"}]
)Mistral vs Alternatives: When to Pick Each
| Use Case | Best Model | Why |
|---|---|---|
| General chat | Mistral 7B | Fast, runs on anything |
| Code generation | Codestral 22B | Trained on code, beats 7B by wide margin |
| Multilingual | Mistral Nemo | Best multilingual in 12B class |
| High quality (local) | Mixtral 8x7B | Near-GPT-3.5 quality at 26 GB VRAM |
| Production API | Mistral Large 2 | Top-tier, use Mistral cloud for cost |
Cloud Option for Mixtral 8×22B
Mixtral 8×22B needs 70 GB VRAM. That's 2× A100 80GB at ~$3.78/hr on Lambda Labs, or a single H100 80GB at $2.49/hr on RunPod with quantization. For occasional inference, use Together AI or Fireworks for pay-per-token.