DeepSeek R1 is a chain-of-thought reasoning model that rivals OpenAI o1 on math and coding benchmarks — and you can run it entirely on your own hardware. The 1.5B and 7B distilled variants run on a laptop CPU. The 32B quantized version fits on a single RTX 3090. Only the full 671B requires cloud.
Method 1: Ollama (Easiest)
Ollama is the fastest way to get DeepSeek R1 running. It auto-selects the best quantization for your hardware:
# Install Ollama (if not already installed)
curl -fsSL https://ollama.com/install.sh | sh
# Run the 7B model (good CPU performance)
ollama run deepseek-r1:7b
# Run 14B (needs 12 GB VRAM or fast CPU)
ollama run deepseek-r1:14b
# Run 32B (needs 24 GB+ VRAM)
ollama run deepseek-r1:32b
# List all available variants
ollama search deepseek-r1DeepSeek R1 uses <think> tags to show its reasoning chain before answering. On Ollama, you'll see the reasoning tokens streamed in real time — this is normal and part of the model's design.
Method 2: LM Studio (GUI, No Terminal)
LM Studio gives you a ChatGPT-like interface with no terminal required:
# 1. Download LM Studio from lmstudio.ai
# 2. Open LM Studio → Search tab
# 3. Search "deepseek-r1"
# 4. Download your size variant (7B recommended for laptops)
# 5. Click "Load Model" → Start chatting
# For API access, enable Local Server in LM Studio
# Then use OpenAI SDK pointing to localhost:1234Method 3: llama.cpp (Maximum Control)
For custom quantization, CPU threading control, or server mode:
# Build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j$(nproc)
# Download GGUF model from Hugging Face
# huggingface-cli download bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF \
# --include "*Q4_K_M*" --local-dir ./models
# Run with GPU offload (24 layers on GPU, rest on CPU)
./llama-cli -m models/DeepSeek-R1-14B-Q4_K_M.gguf \
--n-gpu-layers 24 \
--threads 8 \
--ctx-size 8192 \
-p "Solve: what is 17 × 23?"
# Run as OpenAI-compatible server
./llama-server -m models/DeepSeek-R1-14B-Q4_K_M.gguf \
--n-gpu-layers 24 --port 8080Running R1 671B on Cloud (No Local GPU)
The full 671B model needs 8× H100 80GB GPUs (640 GB total VRAM). That's ~$20–30/hr on cloud. Use it via Together AI or Fireworks for serverless inference at $0.55/M tokens instead of running it yourself.
# Together AI - full R1 671B, OpenAI-compatible
pip install together
from together import Together
client = Together()
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1",
messages=[{"role": "user", "content": "Prove sqrt(2) is irrational"}]
)
print(response.choices[0].message.content)Quantization Cheat Sheet
GGUF quantization lets you trade quality for VRAM. For most use cases, Q4_K_M is the sweet spot:
| Quantization | 14B VRAM | Quality |
|---|---|---|
| Q2_K | 4.5 GB | Noticeable degradation |
| Q4_K_M | 8.7 GB | Good — recommended |
| Q5_K_M | 10.3 GB | Very good |
| Q8_0 | 15.1 GB | Near-lossless |
| F16 | 28 GB | Full precision |