Skip to main content
guidellmlocal

How to Run Gemma 4 Locally (Text, Audio, Image)

Run Google Gemma 4 locally with Ollama and Python transformers. Multimodal image input examples included.

April 10, 20267 min read
Gemma 4 Variants
Gemma 4 4B
4 GB VRAM
Text + Image
Laptop-friendly
Gemma 4 12B
10 GB VRAM
Text + Image + Audio
RTX 3080 class
Gemma 4 27B
20 GB VRAM
All modalities
RTX 3090 / 4090

Gemma 4 is Google's fourth-generation open model family. Unlike Gemma 3, Gemma 4 supports text, images, and audio in the 12B and 27B variants. The 4B text-only version runs on any modern laptop. All weights are available on Hugging Face under a permissive license.

Requirements

ModelMin VRAMMin RAMRecommended GPU
Gemma 4 4B4 GB8 GBGTX 1650 / CPU
Gemma 4 12B10 GB16 GBRTX 3080 Ti / 4070
Gemma 4 27B20 GB32 GBRTX 3090 / 4090

Method 1: Ollama

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Gemma 4 4B (text)
ollama run gemma4:4b

# Run Gemma 4 12B (multimodal)
ollama run gemma4:12b

# Run with an image input
ollama run gemma4:12b "Describe this image" ./photo.jpg

# Run Gemma 4 27B
ollama run gemma4:27b

Method 2: Hugging Face Transformers (Python)

For fine-tuning or custom pipelines, use the transformers library directly:

pip install transformers accelerate torch

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/gemma-4-12b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"  # auto-distributes across available GPUs
)

messages = [{"role": "user", "content": "Explain neural networks"}]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", return_dict=True
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using Audio and Image Inputs (12B / 27B)

from transformers import AutoProcessor, Gemma4ForConditionalGeneration
from PIL import Image
import torch

model_id = "google/gemma-4-12b-it"
processor = AutoProcessor.from_pretrained(model_id)
model = Gemma4ForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

# Image input
image = Image.open("chart.png")
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "What does this chart show?"}
    ]
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True,
    tokenize=True, return_tensors="pt",
    images=[image]
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))

Cloud Option: When Local Isn't Enough

The 27B model at full precision needs 54 GB VRAM. An A100 80GB at $1.89/hr on Lambda Labs gives you enough headroom for batch inference jobs. For development use, an RTX 4090 at $0.74/hr on RunPod runs the 27B at Q4 quantization.

Stay ahead on GPU pricing

Get weekly GPU price reports, new hardware analysis, and cost optimization tips. Join engineers and researchers who save thousands on cloud compute.

No spam. Unsubscribe anytime. We respect your inbox.

Find the cheapest GPU for your workload

Compare real-time prices across tracked cloud providers and marketplaces with 5,000+ instances. Updated every 6 hours.

Compare GPU Prices →

Related Articles

We use cookies for analytics and to remember your preferences. Privacy Policy