Skip to main content
trainingguidellm

How to Fine-Tune Llama on a Cloud GPU (Step by Step)

Full QLoRA fine-tuning pipeline with Axolotl: data formatting, training on cloud GPU, and GGUF export.

April 10, 202612 min read
Fine-Tuning Cost Estimate
~$1–3
Llama 3.1 8B QLoRA
1k examples, 3 epochs
~$8–15
Llama 3.1 70B QLoRA
1k examples, 3 epochs
A100 80GB
Recommended GPU
or H100 for 70B
30–60 min
Training time (8B)
on A100

Fine-tuning Llama gives you a model that follows your domain-specific format, style, or task. With QLoRA (Quantized Low-Rank Adaptation), you can fine-tune Llama 3.1 8B on a single A100 80GB for under $5. This guide uses Axolotl — the simplest fine-tuning toolkit that handles QLoRA out of the box.

Step 1: Provision a Cloud GPU

For Llama 3.1 8B fine-tuning: an A100 40GB ($1.19/hr on RunPod) is the minimum. For 70B: use A100 80GB ($1.89/hr on Lambda). SSH into your instance and verify CUDA:

# Verify GPU and CUDA
nvidia-smi
nvcc --version

# Update and install basics
apt-get update && apt-get install -y git python3-pip
pip install -U pip

Step 2: Format Your Training Data

Axolotl supports multiple formats. The simplest is Alpaca-style JSON:

# data.jsonl — one example per line
{"instruction": "Summarize this support ticket", "input": "Customer reports login fails on mobile app after update 3.2.1", "output": "Login regression introduced in 3.2.1 on mobile. Priority: high. Assign to mobile team."}
{"instruction": "Summarize this support ticket", "input": "User asks how to export data to CSV", "output": "Feature request for CSV export. Priority: low. Add to backlog."}

# Minimum viable dataset: 200+ examples
# Recommended: 1,000–5,000 high-quality examples
# More is not always better — quality > quantity

Step 3: Install Axolotl

git clone https://github.com/OpenAccess-AI-Collective/axolotl
cd axolotl

pip install packaging ninja
pip install -e '.[flash-attn,deepspeed]'

# Log in to Hugging Face (to download Llama weights)
huggingface-cli login
# Paste your HF token from huggingface.co/settings/tokens

Step 4: Create Training Config

# config.yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_4bit: true
strict: false

datasets:
  - path: data.jsonl
    type: alpaca

dataset_prepared_path: ./prepared_data
val_set_size: 0.05
output_dir: ./output

sequence_len: 4096
sample_packing: true

adapter: qlora
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

bf16: true
tf32: true
gradient_checkpointing: true
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
logging_steps: 10
save_steps: 100

Step 5: Run Training

# Start fine-tuning
accelerate launch -m axolotl.cli.train config.yaml

# Monitor GPU usage in another terminal
watch -n 1 nvidia-smi

# Training for 1k examples takes ~30-60 min on A100
# Checkpoint saved every 100 steps in ./output/

Step 6: Merge Adapter and Export

# Merge QLoRA adapter into base model
python -m axolotl.cli.merge_lora config.yaml

# Convert to GGUF for local inference with Ollama/llama.cpp
cd llama.cpp
python convert_hf_to_gguf.py ../output/merged --outfile my-model.gguf

# Create Ollama model from GGUF
cat > Modelfile << 'EOF'
FROM ./my-model.gguf
SYSTEM "You are a helpful customer support agent."
EOF
ollama create my-fine-tuned-model -f Modelfile
ollama run my-fine-tuned-model

Cost Summary

At $1.19/hr for A100 40GB on RunPod, a 1-hour Llama 3.1 8B fine-tune costs about $1.20. The 70B on A100 80GB ($1.89/hr) for 4 hours is about $7.56. Always enable gradient checkpointing to reduce VRAM use, and use spot instances if your training has checkpoints.

Stay ahead on GPU pricing

Get weekly GPU price reports, new hardware analysis, and cost optimization tips. Join engineers and researchers who save thousands on cloud compute.

No spam. Unsubscribe anytime. We respect your inbox.

Find the cheapest GPU for your workload

Compare real-time prices across tracked cloud providers and marketplaces with 5,000+ instances. Updated every 6 hours.

Compare GPU Prices →

Related Articles