DeepSeek R1 Variants — Pick Your Hardware

R1 1.5B

2 GB

Any modern laptop (CPU)

Fast

R1 7B

5 GB

GTX 1660 / M1 Mac

Good

R1 14B

9 GB

RTX 3080 / M2 Pro

Good

R1 32B

20 GB

RTX 3090 / A10G

Moderate

R1 70B

40 GB

2× RTX 3090 or A100

Slow locally

R1 671B

400 GB

Cloud only (H100 ×8)

Cloud

DeepSeek R1 is a chain-of-thought reasoning model that rivals OpenAI o1 on math and coding benchmarks — and you can run it entirely on your own hardware. The 1.5B and 7B distilled variants run on a laptop CPU. The 32B quantized version fits on a single RTX 3090. Only the full 671B requires cloud.

Method 1: Ollama (Easiest)

Ollama is the fastest way to get DeepSeek R1 running. It auto-selects the best quantization for your hardware:

# Install Ollama (if not already installed)
curl -fsSL https://ollama.com/install.sh | sh

# Run the 7B model (good CPU performance)
ollama run deepseek-r1:7b

# Run 14B (needs 12 GB VRAM or fast CPU)
ollama run deepseek-r1:14b

# Run 32B (needs 24 GB+ VRAM)
ollama run deepseek-r1:32b

# List all available variants
ollama search deepseek-r1

DeepSeek R1 uses <think> tags to show its reasoning chain before answering. On Ollama, you'll see the reasoning tokens streamed in real time — this is normal and part of the model's design.

Method 2: LM Studio (GUI, No Terminal)

LM Studio gives you a ChatGPT-like interface with no terminal required:

# 1. Download LM Studio from lmstudio.ai
# 2. Open LM Studio → Search tab
# 3. Search "deepseek-r1"
# 4. Download your size variant (7B recommended for laptops)
# 5. Click "Load Model" → Start chatting

# For API access, enable Local Server in LM Studio
# Then use OpenAI SDK pointing to localhost:1234

Method 3: llama.cpp (Maximum Control)

For custom quantization, CPU threading control, or server mode:

# Build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j$(nproc)

# Download GGUF model from Hugging Face
# huggingface-cli download bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF \
#   --include "*Q4_K_M*" --local-dir ./models

# Run with GPU offload (24 layers on GPU, rest on CPU)
./llama-cli -m models/DeepSeek-R1-14B-Q4_K_M.gguf \
  --n-gpu-layers 24 \
  --threads 8 \
  --ctx-size 8192 \
  -p "Solve: what is 17 × 23?"

# Run as OpenAI-compatible server
./llama-server -m models/DeepSeek-R1-14B-Q4_K_M.gguf \
  --n-gpu-layers 24 --port 8080

Running R1 671B on Cloud (No Local GPU)

The full 671B model needs 8× H100 80GB GPUs (640 GB total VRAM). That's ~$20–30/hr on cloud. Use it via Together AI or Fireworks for serverless inference at $0.55/M tokens instead of running it yourself.

# Together AI - full R1 671B, OpenAI-compatible
pip install together

from together import Together
client = Together()
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[{"role": "user", "content": "Prove sqrt(2) is irrational"}]
)
print(response.choices[0].message.content)

Quantization Cheat Sheet

GGUF quantization lets you trade quality for VRAM. For most use cases, Q4_K_M is the sweet spot:

Quantization	14B VRAM	Quality
Q2_K	4.5 GB	Noticeable degradation
Q4_K_M	8.7 GB	Good — recommended
Q5_K_M	10.3 GB	Very good
Q8_0	15.1 GB	Near-lossless
F16	28 GB	Full precision

→ RTX 3090 Cloud Prices → A100 Instances → H100 for Inference → Best GPU for LLM Inference → Cheapest GPU Cloud

How to Run DeepSeek R1 Locally (No GPU Required)

Method 1: Ollama (Easiest)

Method 2: LM Studio (GUI, No Terminal)

Method 3: llama.cpp (Maximum Control)

Running R1 671B on Cloud (No Local GPU)

Quantization Cheat Sheet

Related Articles

How to Run Llama 4 Locally (Scout + Maverick)

How to Run Gemma 4 Locally (Text, Audio, Image)

How to Run Qwen 3 Locally with Ollama