Skip to main content
guidellmlocal

How to Run Mistral Models Locally

Run Mistral 7B, Mixtral 8x7B, and Codestral locally via Ollama and vLLM. Performance benchmarks on CPU and GPU.

April 10, 20267 min read
Mistral Model Family — Local Requirements
Mistral 7B v0.3
5 GB
Daily driver
Mistral Small 3 (22B)
14 GB
Balanced quality
Mixtral 8×7B (MoE)
26 GB
High quality
Mixtral 8×22B (MoE)
70 GB
Cloud recommended
Mistral Large 2
123 GB
Cloud only

Mistral AI has the most practical open-weight model family for local deployment. Mistral 7B runs on a laptop with a 6 GB GPU. Mixtral 8×7B uses a sparse MoE architecture — it has 47B parameters total but only activates 13B per token, making it fast despite its size. This guide covers all three practical local options.

Method 1: Ollama (Recommended)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Mistral 7B — runs on anything with 6 GB VRAM
ollama run mistral

# Mistral Small 3 (22B) — better quality
ollama run mistral-small3

# Mixtral 8x7B — MoE, needs 26 GB VRAM
ollama run mixtral

# Mistral Nemo (12B, multilingual)
ollama run mistral-nemo

# Codestral (coding specialist)
ollama run codestral

Method 2: vLLM (Production Inference Server)

For serving Mistral to multiple users or building an API backend, vLLM offers continuous batching and much higher throughput than Ollama:

pip install vllm

# Serve Mistral 7B with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --port 8000 \
  --max-model-len 32768

# Serve Mixtral 8x7B (needs 2× A100 or equivalent)
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tensor-parallel-size 2 \
  --port 8000

Method 3: LM Studio (No Terminal)

# 1. Download LM Studio from lmstudio.ai
# 2. Open → Discover → search "mistral"
# 3. Pick model size based on your VRAM
# 4. Click Load → Start chatting
#
# For the API:
# Enable "Local Server" in LM Studio settings
# Then use the OpenAI SDK against http://localhost:1234

from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lmstudio")
response = client.chat.completions.create(
    model="mistral-7b-instruct",
    messages=[{"role":"user","content":"Hello!"}]
)

Mistral vs Alternatives: When to Pick Each

Use CaseBest ModelWhy
General chatMistral 7BFast, runs on anything
Code generationCodestral 22BTrained on code, beats 7B by wide margin
MultilingualMistral NemoBest multilingual in 12B class
High quality (local)Mixtral 8x7BNear-GPT-3.5 quality at 26 GB VRAM
Production APIMistral Large 2Top-tier, use Mistral cloud for cost

Cloud Option for Mixtral 8×22B

Mixtral 8×22B needs 70 GB VRAM. That's 2× A100 80GB at ~$3.78/hr on Lambda Labs, or a single H100 80GB at $2.49/hr on RunPod with quantization. For occasional inference, use Together AI or Fireworks for pay-per-token.

Stay ahead on GPU pricing

Get weekly GPU price reports, new hardware analysis, and cost optimization tips. Join engineers and researchers who save thousands on cloud compute.

No spam. Unsubscribe anytime. We respect your inbox.

Find the cheapest GPU for your workload

Compare real-time prices across tracked cloud providers and marketplaces with 5,000+ instances. Updated every 6 hours.

Compare GPU Prices →

Related Articles

We use cookies for analytics and to remember your preferences. Privacy Policy