Qwen 3 Quick Reference
Qwen3 0.6B
1 GB
Edge / IoT
Qwen3 4B
3 GB
Laptop
Qwen3 8B
6 GB
Daily driver
Qwen3 14B
10 GB
RTX 3080
Qwen3 32B
22 GB
RTX 3090/4090
Qwen3 72B
48 GB
Multi-GPU / Cloud
Qwen 3 is Alibaba's latest open model family. The Qwen3 8B beats GPT-4o on HumanEval coding benchmarks and runs on a single consumer GPU. It supports 128K context, 29 languages, and tool calling out of the box.
Step 1: Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Verify
ollama --version
# Start Ollama server (runs in background automatically)
ollama serve &Step 2: Pull Your Qwen 3 Variant
# For most developers: 8B is the sweet spot
ollama pull qwen3:8b
# Pull a specific quantization
ollama pull qwen3:14b-q4_K_M # 9 GB VRAM
ollama pull qwen3:14b-q8_0 # 15 GB VRAM (better quality)
# List what you have downloaded
ollama listStep 3: Run Qwen 3
# Interactive chat
ollama run qwen3:8b
# Single prompt
ollama run qwen3:8b "Write a Python function to parse CSV files"
# With extended context (requires more VRAM)
OLLAMA_NUM_CTX=32768 ollama run qwen3:14b
# Thinking mode (built-in CoT, like DeepSeek R1)
ollama run qwen3:8b "/think Prove that 0.999... = 1"Using Qwen 3 as an API
# Python with OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
response = client.chat.completions.create(
model="qwen3:8b",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a binary search in Python"}
]
)
print(response.choices[0].message.content)
# Tool calling (Qwen3 supports function calling natively)
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}}
}
}]
response = client.chat.completions.create(
model="qwen3:8b",
messages=[{"role":"user","content":"What's the weather in Tokyo?"}],
tools=tools
)Performance vs Cloud Cost
| Setup | Tokens/sec | Cost |
|---|---|---|
| RTX 4090 local (14B) | ~80 t/s | $0 (owned) |
| RTX 4090 RunPod (14B) | ~80 t/s | $0.74/hr |
| A100 Lambda (72B Q4) | ~45 t/s | $1.89/hr |
| CPU only (8B) | ~8 t/s | $0 (laptop) |