Qwen 3 Quick Reference

Qwen3 0.6B

1 GB

ollama run qwen3:0.6b

Edge / IoT

Qwen3 4B

3 GB

ollama run qwen3:4b

Laptop

Qwen3 8B

6 GB

ollama run qwen3:8b

Daily driver

Qwen3 14B

10 GB

ollama run qwen3:14b

RTX 3080

Qwen3 32B

22 GB

ollama run qwen3:32b

RTX 3090/4090

Qwen3 72B

48 GB

ollama run qwen3:72b

Multi-GPU / Cloud

Qwen 3 is Alibaba's latest open model family. The Qwen3 8B beats GPT-4o on HumanEval coding benchmarks and runs on a single consumer GPU. It supports 128K context, 29 languages, and tool calling out of the box.

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Verify
ollama --version

# Start Ollama server (runs in background automatically)
ollama serve &

Step 2: Pull Your Qwen 3 Variant

# For most developers: 8B is the sweet spot
ollama pull qwen3:8b

# Pull a specific quantization
ollama pull qwen3:14b-q4_K_M   # 9 GB VRAM
ollama pull qwen3:14b-q8_0     # 15 GB VRAM (better quality)

# List what you have downloaded
ollama list

Step 3: Run Qwen 3

# Interactive chat
ollama run qwen3:8b

# Single prompt
ollama run qwen3:8b "Write a Python function to parse CSV files"

# With extended context (requires more VRAM)
OLLAMA_NUM_CTX=32768 ollama run qwen3:14b

# Thinking mode (built-in CoT, like DeepSeek R1)
ollama run qwen3:8b "/think Prove that 0.999... = 1"

Using Qwen 3 as an API

# Python with OpenAI SDK
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a binary search in Python"}
    ]
)
print(response.choices[0].message.content)

# Tool calling (Qwen3 supports function calling natively)
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}
    }
}]
response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role":"user","content":"What's the weather in Tokyo?"}],
    tools=tools
)

Performance vs Cloud Cost

Setup	Tokens/sec	Cost
RTX 4090 local (14B)	~80 t/s	$0 (owned)
RTX 4090 RunPod (14B)	~80 t/s	$0.74/hr
A100 Lambda (72B Q4)	~45 t/s	$1.89/hr
CPU only (8B)	~8 t/s	$0 (laptop)

→ RTX 4090 Cloud Prices → A100 Instances → Best GPU for LLM Inference → Cheapest GPU Cloud

How to Run Qwen 3 Locally with Ollama

Step 1: Install Ollama

Step 2: Pull Your Qwen 3 Variant

Step 3: Run Qwen 3

Using Qwen 3 as an API

Performance vs Cloud Cost

Related Articles

How to Run Llama 4 Locally (Scout + Maverick)

How to Run DeepSeek R1 Locally (No GPU Required)

How to Run Gemma 4 Locally (Text, Audio, Image)