Gemma 4 Variants

Gemma 4 4B

4 GB VRAM

Text + Image

Laptop-friendly

Gemma 4 12B

10 GB VRAM

Text + Image + Audio

RTX 3080 class

Gemma 4 27B

20 GB VRAM

All modalities

RTX 3090 / 4090

Gemma 4 is Google's fourth-generation open model family. Unlike Gemma 3, Gemma 4 supports text, images, and audio in the 12B and 27B variants. The 4B text-only version runs on any modern laptop. All weights are available on Hugging Face under a permissive license.

Requirements

Model	Min VRAM	Min RAM	Recommended GPU
Gemma 4 4B	4 GB	8 GB	GTX 1650 / CPU
Gemma 4 12B	10 GB	16 GB	RTX 3080 Ti / 4070
Gemma 4 27B	20 GB	32 GB	RTX 3090 / 4090

Method 1: Ollama

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Gemma 4 4B (text)
ollama run gemma4:4b

# Run Gemma 4 12B (multimodal)
ollama run gemma4:12b

# Run with an image input
ollama run gemma4:12b "Describe this image" ./photo.jpg

# Run Gemma 4 27B
ollama run gemma4:27b

Method 2: Hugging Face Transformers (Python)

For fine-tuning or custom pipelines, use the transformers library directly:

pip install transformers accelerate torch

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/gemma-4-12b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"  # auto-distributes across available GPUs
)

messages = [{"role": "user", "content": "Explain neural networks"}]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", return_dict=True
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using Audio and Image Inputs (12B / 27B)

from transformers import AutoProcessor, Gemma4ForConditionalGeneration
from PIL import Image
import torch

model_id = "google/gemma-4-12b-it"
processor = AutoProcessor.from_pretrained(model_id)
model = Gemma4ForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

# Image input
image = Image.open("chart.png")
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "What does this chart show?"}
    ]
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True,
    tokenize=True, return_tensors="pt",
    images=[image]
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))

Cloud Option: When Local Isn't Enough

The 27B model at full precision needs 54 GB VRAM. An A100 80GB at $1.89/hr on Lambda Labs gives you enough headroom for batch inference jobs. For development use, an RTX 4090 at $0.74/hr on RunPod runs the 27B at Q4 quantization.

→ RTX 4090 Cloud Prices → A100 80GB Instances → Best GPU for LLM Inference → Compare All Providers

How to Run Gemma 4 Locally (Text, Audio, Image)

Requirements

Method 1: Ollama

Method 2: Hugging Face Transformers (Python)

Using Audio and Image Inputs (12B / 27B)

Cloud Option: When Local Isn't Enough

Related Articles

How to Run Llama 4 Locally (Scout + Maverick)

How to Run DeepSeek R1 Locally (No GPU Required)

How to Run Qwen 3 Locally with Ollama