TinyZero

TinyZero is a reproduction of DeepSeek R1 Zero in countdown and multiplication tasks. We built upon veRL.

Through RL, the 3B base LM develops self-verification and search abilities all on its own

You can experience the Ahah moment yourself for < $30

Twitter thread: https://x.com/jiayi_pirate/status/1882839370505621655

Full experiment log: https://wandb.ai/jiayipan/TinyZero

Paper's on it's way!

Update: Quick start (For those who are interested on running it on 2 H100s/A100s

This is a condensed and runnable guide for setting up and running TinyZero on Hyperbolic using the specified environment.

1. Set up Hyperbolic H100 Environment

Sign up/sign in:
- W&B
- Hyperbolic
Start a machine:
- Instance: A100 SXM or H100 SXM, select 2 GPUs.
- Image:
  nvidia-cuda124-ubuntu2204

2. System Preparation Run the following commands to update and install dependencies:

# Update and upgrade system
sudo apt update && sudo apt upgrade -y

# Install necessary packages
sudo apt install -y git python3 python3-pip

# create virtual env and activate 
python -m venv myenv
source myenv/bin/activate

3. Clone and Set Up TinyZero

# Clone the TinyZero repository
git clone https://github.com/JerryWu-code/TinyZero.git

# Navigate into the TinyZero directory
cd TinyZero

4. Install Dependencies

# Install PyTorch (or let vLLM handle the correct version)
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121

# Install vLLM
pip3 install vllm==0.6.3  # Other versions: 0.5.4, 0.4.2, 0.3.1

# Install Ray
pip3 install ray

# Install TinyZero in editable mode
pip install -e .

# Install Flash Attention 2
pip3 install flash-attn --no-build-isolation

# Install quality-of-life tools
pip install wandb IPython matplotlib

5. Log in to W&B

wandb login

6. Download Dataset

huggingface-cli download Jiayi-Pan/Countdown-Tasks-3to4 \
  --local-dir ./data/countdown --repo-type dataset

7. Preprocess Dataset

python ./examples/data_preprocess/countdown.py --local_dir ./data/countdown

8. Download and Save Pretrained Model

python3 -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = 'Qwen/Qwen2.5-3B'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,  # Use bfloat16 for Flash Attention 2.0
)
model.save_pretrained('/home/ubuntu/TinyZero/models/Qwen2.5-3B')
tokenizer.save_pretrained('/home/ubuntu/TinyZero/models/Qwen2.5-3B')
"

9. Set Environment Variables

export N_GPUS=2
export BASE_MODEL="./models/Qwen2.5-3B"
export DATA_DIR="./data/countdown"
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS

10. Update Training Script Edit the training script to remove unnecessary prefixes from the Python path:

vim scripts/train_tiny_zero_a100_grpo.sh

Remove /home/weiji/anaconda3/envs/zero/bin/ from the first line so it directly references python3.

11. Make Script Executable

chmod +x ./scripts/train_tiny_zero_a100_grpo.sh

12. Start Training

./scripts/train_tiny_zero_a100_grpo.sh

Original Installation

conda create -n zero python=3.9
# install torch [or you can skip this step and let vllm to install the correct version for you]
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# install vllm
pip3 install vllm==0.6.3 # or you can install 0.5.4, 0.4.2 and 0.3.1
pip3 install ray

# verl
pip install -e .

# flash attention 2
pip3 install flash-attn --no-build-isolation
# quality of life
pip install wandb IPython matplotlib

Countdown task

Data Preparation

conda activate zero
python ./examples/data_preprocess/countdown.py --local_dir {path_to_your_dataset}

Run Training

conda activate zero

For the following code, if you see Out-of-vram, try add critic.model.enable_gradient_checkpointing=True to the script, and checkout the discussion here

Single GPU

Works for model <= 1.5B. For Qwen2.5-0.5B base, we know it fails to learn reasoning.

export N_GPUS=1
export BASE_MODEL={path_to_your_model}
export DATA_DIR={path_to_your_dataset}
export ROLLOUT_TP_SIZE=1
export EXPERIMENT_NAME=countdown-qwen2.5-0.5b
export VLLM_ATTENTION_BACKEND=XFORMERS

bash ./scripts/train_tiny_zero.sh

3B+ model In this case, the base model is able to develop sophisticated reasoning skills.

export N_GPUS=2
export BASE_MODEL={path_to_your_model}
export DATA_DIR={path_to_your_dataset}
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS

bash ./scripts/train_tiny_zero.sh

Instruct Ablation

We experiment with QWen-2.5-3B Instruct too. Data Preparation To follow chat template, we need to reprocess the data:

conda activate zero
python examples/data_preprocess/countdown.py --template_type=qwen-instruct --local_dir={path_to_your_dataset}

Training

export N_GPUS=2
export BASE_MODEL={path_to_your_model}
export DATA_DIR={path_to_your_dataset}
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b-instruct
export VLLM_ATTENTION_BACKEND=XFORMERS

bash ./scripts/train_tiny_zero.sh

Acknowledge

We run our experiments based on veRL.
We use Qwen2.5 series base model Qwen2.5.

Citation

@misc{tinyzero,
author       = {Jiayi Pan and Junjie Zhang and Xingyao Wang and Lifan Yuan and Hao Peng and Alane Suhr},
title        = {TinyZero},
howpublished = {https://github.com/Jiayi-Pan/TinyZero},
note         = {Accessed: 2025-01-24},
year         = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
.github/workflows		.github/workflows
docker		docker
docs		docs
examples		examples
patches		patches
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.style.yapf		.style.yapf
LICENSE		LICENSE
Notice.txt		Notice.txt
OLD_README.md		OLD_README.md
README.md		README.md
cover.png		cover.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
train_0.5b_ppo.sh		train_0.5b_ppo.sh
train_1.5b_ppo.sh		train_1.5b_ppo.sh
train_3b_grpo.sh		train_3b_grpo.sh
train_3b_instruct_ppo.sh		train_3b_instruct_ppo.sh
train_3b_ppo.sh		train_3b_ppo.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinyZero

Update: Quick start (For those who are interested on running it on 2 H100s/A100s

8. Download and Save Pretrained Model

Original Installation

Countdown task

Run Training

Instruct Ablation

Acknowledge

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TinyZero

Update: Quick start (For those who are interested on running it on 2 H100s/A100s

8. Download and Save Pretrained Model

Original Installation

Countdown task

Run Training

Instruct Ablation

Acknowledge

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages