A comparison study of Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) for instruction-following language models using Mistral-7B.
This project trains and evaluates three model variants:
- Base Model - Mistral-7B-v0.1 (no fine-tuning)
- SFT Model - Base + Supervised Fine-Tuning on instruction data
- SFT+DPO Model - SFT model + Direct Preference Optimization on preference pairs
- Trains the model to follow instructions using (instruction, response) pairs
- Dataset: Alpaca-GPT4 (~10K examples)
- LoRA configuration: r=16, alpha=32
- Learning rate: 2e-5
- Epochs: 3
- Trains the model to prefer "chosen" responses over "rejected" ones
- Dataset: UltraFeedback preference pairs
- Starts from the SFT model checkpoint
- Learning rate: 5e-7 (much lower than SFT)
- Beta (KL penalty): 0.1
- Epochs: 1
.
├── scripts/
│ ├── train_sft.py # SFT training with LoRA
│ ├── train_dpo.py # DPO training on top of SFT
│ ├── evaluate.py # Evaluate all model variants
│ ├── upload_results.py # Upload models to Hugging Face Hub
│ └── run_full_pipeline.sh # End-to-end training pipeline
├── configs/ # Training configurations
├── results/ # Evaluation outputs
├── requirements.txt
└── README.md
- Python 3.10+
- CUDA-capable GPU (A100 recommended)
- ~40GB GPU memory for training
Install dependencies:
pip install -r requirements.txtbash scripts/run_full_pipeline.shTrain SFT model:
python scripts/train_sft.pyTrain DPO model (requires SFT model):
python scripts/train_dpo.pyEvaluate all models:
python scripts/evaluate.pyUpload to Hugging Face Hub:
python scripts/upload_results.pyThe evaluation script tests models on diverse prompts across categories:
- Instruction following
- Reasoning
- Creative writing
- Factual knowledge
- Safety (refusal behavior)
- Coding
Results are saved to:
results/evaluation_results.json- Full JSON resultsresults/side_by_side_comparison.txt- Human-readable comparison
Designed to run on Lambda Labs GPU instances. Estimated costs:
- Training time: ~10-12 hours
- Cost: ~$10-13 on a single A100