Skip to main content
guidellmlocal

How to Run DeepSeek R1 Locally (No GPU Required)

Run DeepSeek R1 on your machine with Ollama, LM Studio, or llama.cpp. Quantization guide and cloud API fallback.

April 10, 20268 min read
DeepSeek R1 Variants — Pick Your Hardware
R1 1.5B
2 GB
Any modern laptop (CPU)
Fast
R1 7B
5 GB
GTX 1660 / M1 Mac
Good
R1 14B
9 GB
RTX 3080 / M2 Pro
Good
R1 32B
20 GB
RTX 3090 / A10G
Moderate
R1 70B
40 GB
2× RTX 3090 or A100
Slow locally
R1 671B
400 GB
Cloud only (H100 ×8)
Cloud

DeepSeek R1 is a chain-of-thought reasoning model that rivals OpenAI o1 on math and coding benchmarks — and you can run it entirely on your own hardware. The 1.5B and 7B distilled variants run on a laptop CPU. The 32B quantized version fits on a single RTX 3090. Only the full 671B requires cloud.

Method 1: Ollama (Easiest)

Ollama is the fastest way to get DeepSeek R1 running. It auto-selects the best quantization for your hardware:

# Install Ollama (if not already installed)
curl -fsSL https://ollama.com/install.sh | sh

# Run the 7B model (good CPU performance)
ollama run deepseek-r1:7b

# Run 14B (needs 12 GB VRAM or fast CPU)
ollama run deepseek-r1:14b

# Run 32B (needs 24 GB+ VRAM)
ollama run deepseek-r1:32b

# List all available variants
ollama search deepseek-r1

DeepSeek R1 uses <think> tags to show its reasoning chain before answering. On Ollama, you'll see the reasoning tokens streamed in real time — this is normal and part of the model's design.

Method 2: LM Studio (GUI, No Terminal)

LM Studio gives you a ChatGPT-like interface with no terminal required:

# 1. Download LM Studio from lmstudio.ai
# 2. Open LM Studio → Search tab
# 3. Search "deepseek-r1"
# 4. Download your size variant (7B recommended for laptops)
# 5. Click "Load Model" → Start chatting

# For API access, enable Local Server in LM Studio
# Then use OpenAI SDK pointing to localhost:1234

Method 3: llama.cpp (Maximum Control)

For custom quantization, CPU threading control, or server mode:

# Build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j$(nproc)

# Download GGUF model from Hugging Face
# huggingface-cli download bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF \
#   --include "*Q4_K_M*" --local-dir ./models

# Run with GPU offload (24 layers on GPU, rest on CPU)
./llama-cli -m models/DeepSeek-R1-14B-Q4_K_M.gguf \
  --n-gpu-layers 24 \
  --threads 8 \
  --ctx-size 8192 \
  -p "Solve: what is 17 × 23?"

# Run as OpenAI-compatible server
./llama-server -m models/DeepSeek-R1-14B-Q4_K_M.gguf \
  --n-gpu-layers 24 --port 8080

Running R1 671B on Cloud (No Local GPU)

The full 671B model needs 8× H100 80GB GPUs (640 GB total VRAM). That's ~$20–30/hr on cloud. Use it via Together AI or Fireworks for serverless inference at $0.55/M tokens instead of running it yourself.

# Together AI - full R1 671B, OpenAI-compatible
pip install together

from together import Together
client = Together()
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[{"role": "user", "content": "Prove sqrt(2) is irrational"}]
)
print(response.choices[0].message.content)

Quantization Cheat Sheet

GGUF quantization lets you trade quality for VRAM. For most use cases, Q4_K_M is the sweet spot:

Quantization14B VRAMQuality
Q2_K4.5 GBNoticeable degradation
Q4_K_M8.7 GBGood — recommended
Q5_K_M10.3 GBVery good
Q8_015.1 GBNear-lossless
F1628 GBFull precision

Stay ahead on GPU pricing

Get weekly GPU price reports, new hardware analysis, and cost optimization tips. Join engineers and researchers who save thousands on cloud compute.

No spam. Unsubscribe anytime. We respect your inbox.

Find the cheapest GPU for your workload

Compare real-time prices across tracked cloud providers and marketplaces with 5,000+ instances. Updated every 6 hours.

Compare GPU Prices →

Related Articles