Reliability-aware pruning for scalable on-policy distillation.
We recommend a CUDA 12 machine with 8 NVIDIA GPUs. The training code is based on verl.
# 1. Clone and enter the repository.
git clone <repo-url> prune-opd
cd prune-opd
# 2. Create the environment.
conda create -n opd python=3.12 -y
conda activate opd
# 3. Install verl runtime dependencies.
cd verl
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
cd ..
# 4. Install local packages.
pip install -e ./verl
pip install math-verifyIf the FlashAttention wheel selected by verl/scripts/install_vllm_sglang_mcore.sh does not match your CUDA/PyTorch platform, install the matching wheel manually and rerun the remaining pip installs.
Set the data and model roots before running scripts:
export DATA_ROOT=/path/to/datasets
export MODEL_ROOT=/path/to/modelsExpected data layout:
${DATA_ROOT}/dapo-math-17k.parquet
${DATA_ROOT}/test_data/AMC23/test.parquet
${DATA_ROOT}/test_data/AIME24/test.parquet
${DATA_ROOT}/test_data/AIME25/test.parquet
${DATA_ROOT}/test_data/HMMT24/test.parquet
${DATA_ROOT}/test_data/HMMT25/test.parquet
Expected model directories:
${MODEL_ROOT}/DeepSeek-R1-Distill-Qwen-1.5B
${MODEL_ROOT}/JustRL-DeepSeek-1.5B
${MODEL_ROOT}/Qwen3-4B-Base
${MODEL_ROOT}/Qwen3-4B
You may also set ACTOR_MODEL_PATH and REWARD_MODEL_PATH directly.
We provide OPD and Prune-OPD scripts for two teacher-student pairs.
| Script | Description |
|---|---|
experiments_scripts/opd-baseline-deepseek-r1-distill-qwen-1.5b-justrl-deepseek-1.5b.sh |
OPD baseline for DeepSeek-R1-Distill-Qwen-1.5B / JustRL-DeepSeek-1.5B |
experiments_scripts/prune-opd-deepseek-r1-distill-qwen-1.5b-justrl-deepseek-1.5b.sh |
Prune-OPD for DeepSeek-R1-Distill-Qwen-1.5B / JustRL-DeepSeek-1.5B |
experiments_scripts/opd-baseline-qwen3-4b-base-qwen3-4b-non-thinking.sh |
OPD baseline for Qwen3-4B-Base / Qwen3-4B (Non-thinking) |
experiments_scripts/prune-opd-qwen3-4b-base-qwen3-4b-non-thinking.sh |
Prune-OPD for Qwen3-4B-Base / Qwen3-4B (Non-thinking) |
Preview the resolved command without launching training:
DRY_RUN=1 DATA_ROOT=/path/to/datasets MODEL_ROOT=/path/to/models \
bash experiments_scripts/prune-opd-deepseek-r1-distill-qwen-1.5b-justrl-deepseek-1.5b.shRun training:
DATA_ROOT=/path/to/datasets MODEL_ROOT=/path/to/models \
bash experiments_scripts/prune-opd-qwen3-4b-base-qwen3-4b-non-thinking.shExtra Hydra overrides can be appended:
bash experiments_scripts/prune-opd-qwen3-4b-base-qwen3-4b-non-thinking.sh \
trainer.test_freq=10 actor_rollout_ref.actor.optim.lr=5e-7Main defaults:
- train data: DAPO-Math-17K
- evaluation: AMC23, AIME24, AIME25, HMMT24, HMMT25
- evaluation metric: Avg@16
- max response length: 12288
- validation max response length: 31744
- rollout number: 4
- mini-batch size: 64
- log-prob top-k: 16
- training steps: 203
- Prune-OPD metric: overlap ratio, threshold 0.7
Logging is console-only by default. To enable W&B:
WANDB_API_KEY=<your-key> WANDB_MODE=online TRACKING_BACKENDS='[console,wandb]' \
bash experiments_scripts/opd-baseline-deepseek-r1-distill-qwen-1.5b-justrl-deepseek-1.5b.shIf you find this repository useful, please cite our paper:
@misc{yang2026pruneopd,
title={Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning},
author={Zhicheng Yang and Zhijiang Guo and Yifan Song and Minrui Xu and Yongxin Wang and Yiwei Wang and Xiaodan Liang and Jing Tang},
year={2026},
eprint={2605.07804},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.07804}
}This codebase builds on the excellent verl training framework and the THUNLP/OPD implementation for on-policy distillation.
