Fine-tuning Llama gives you a model that follows your domain-specific format, style, or task. With QLoRA (Quantized Low-Rank Adaptation), you can fine-tune Llama 3.1 8B on a single A100 80GB for under $5. This guide uses Axolotl — the simplest fine-tuning toolkit that handles QLoRA out of the box.
Step 1: Provision a Cloud GPU
For Llama 3.1 8B fine-tuning: an A100 40GB ($1.19/hr on RunPod) is the minimum. For 70B: use A100 80GB ($1.89/hr on Lambda). SSH into your instance and verify CUDA:
# Verify GPU and CUDA
nvidia-smi
nvcc --version
# Update and install basics
apt-get update && apt-get install -y git python3-pip
pip install -U pipStep 2: Format Your Training Data
Axolotl supports multiple formats. The simplest is Alpaca-style JSON:
# data.jsonl — one example per line
{"instruction": "Summarize this support ticket", "input": "Customer reports login fails on mobile app after update 3.2.1", "output": "Login regression introduced in 3.2.1 on mobile. Priority: high. Assign to mobile team."}
{"instruction": "Summarize this support ticket", "input": "User asks how to export data to CSV", "output": "Feature request for CSV export. Priority: low. Add to backlog."}
# Minimum viable dataset: 200+ examples
# Recommended: 1,000–5,000 high-quality examples
# More is not always better — quality > quantityStep 3: Install Axolotl
git clone https://github.com/OpenAccess-AI-Collective/axolotl
cd axolotl
pip install packaging ninja
pip install -e '.[flash-attn,deepspeed]'
# Log in to Hugging Face (to download Llama weights)
huggingface-cli login
# Paste your HF token from huggingface.co/settings/tokensStep 4: Create Training Config
# config.yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_4bit: true
strict: false
datasets:
- path: data.jsonl
type: alpaca
dataset_prepared_path: ./prepared_data
val_set_size: 0.05
output_dir: ./output
sequence_len: 4096
sample_packing: true
adapter: qlora
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
- gate_proj
- up_proj
- down_proj
bf16: true
tf32: true
gradient_checkpointing: true
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
logging_steps: 10
save_steps: 100Step 5: Run Training
# Start fine-tuning
accelerate launch -m axolotl.cli.train config.yaml
# Monitor GPU usage in another terminal
watch -n 1 nvidia-smi
# Training for 1k examples takes ~30-60 min on A100
# Checkpoint saved every 100 steps in ./output/Step 6: Merge Adapter and Export
# Merge QLoRA adapter into base model
python -m axolotl.cli.merge_lora config.yaml
# Convert to GGUF for local inference with Ollama/llama.cpp
cd llama.cpp
python convert_hf_to_gguf.py ../output/merged --outfile my-model.gguf
# Create Ollama model from GGUF
cat > Modelfile << 'EOF'
FROM ./my-model.gguf
SYSTEM "You are a helpful customer support agent."
EOF
ollama create my-fine-tuned-model -f Modelfile
ollama run my-fine-tuned-modelCost Summary
At $1.19/hr for A100 40GB on RunPod, a 1-hour Llama 3.1 8B fine-tune costs about $1.20. The 70B on A100 80GB ($1.89/hr) for 4 hours is about $7.56. Always enable gradient checkpointing to reduce VRAM use, and use spot instances if your training has checkpoints.