Gemma 4 is Google's fourth-generation open model family. Unlike Gemma 3, Gemma 4 supports text, images, and audio in the 12B and 27B variants. The 4B text-only version runs on any modern laptop. All weights are available on Hugging Face under a permissive license.
Requirements
| Model | Min VRAM | Min RAM | Recommended GPU |
|---|---|---|---|
| Gemma 4 4B | 4 GB | 8 GB | GTX 1650 / CPU |
| Gemma 4 12B | 10 GB | 16 GB | RTX 3080 Ti / 4070 |
| Gemma 4 27B | 20 GB | 32 GB | RTX 3090 / 4090 |
Method 1: Ollama
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Run Gemma 4 4B (text)
ollama run gemma4:4b
# Run Gemma 4 12B (multimodal)
ollama run gemma4:12b
# Run with an image input
ollama run gemma4:12b "Describe this image" ./photo.jpg
# Run Gemma 4 27B
ollama run gemma4:27bMethod 2: Hugging Face Transformers (Python)
For fine-tuning or custom pipelines, use the transformers library directly:
pip install transformers accelerate torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "google/gemma-4-12b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto" # auto-distributes across available GPUs
)
messages = [{"role": "user", "content": "Explain neural networks"}]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", return_dict=True
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Using Audio and Image Inputs (12B / 27B)
from transformers import AutoProcessor, Gemma4ForConditionalGeneration
from PIL import Image
import torch
model_id = "google/gemma-4-12b-it"
processor = AutoProcessor.from_pretrained(model_id)
model = Gemma4ForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
# Image input
image = Image.open("chart.png")
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What does this chart show?"}
]
}]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True,
tokenize=True, return_tensors="pt",
images=[image]
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))Cloud Option: When Local Isn't Enough
The 27B model at full precision needs 54 GB VRAM. An A100 80GB at $1.89/hr on Lambda Labs gives you enough headroom for batch inference jobs. For development use, an RTX 4090 at $0.74/hr on RunPod runs the 27B at Q4 quantization.