Skip to content

yangzhch6/Prune-OPD

Repository files navigation

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

Reliability-aware pruning for scalable on-policy distillation.

arXiv
Prune-OPD overview

Setup

We recommend a CUDA 12 machine with 8 NVIDIA GPUs. The training code is based on verl.

# 1. Clone and enter the repository.
git clone <repo-url> prune-opd
cd prune-opd

# 2. Create the environment.
conda create -n opd python=3.12 -y
conda activate opd

# 3. Install verl runtime dependencies.
cd verl
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
cd ..

# 4. Install local packages.
pip install -e ./verl
pip install math-verify

If the FlashAttention wheel selected by verl/scripts/install_vllm_sglang_mcore.sh does not match your CUDA/PyTorch platform, install the matching wheel manually and rerun the remaining pip installs.

Data and Models

Set the data and model roots before running scripts:

export DATA_ROOT=/path/to/datasets
export MODEL_ROOT=/path/to/models

Expected data layout:

${DATA_ROOT}/dapo-math-17k.parquet
${DATA_ROOT}/test_data/AMC23/test.parquet
${DATA_ROOT}/test_data/AIME24/test.parquet
${DATA_ROOT}/test_data/AIME25/test.parquet
${DATA_ROOT}/test_data/HMMT24/test.parquet
${DATA_ROOT}/test_data/HMMT25/test.parquet

Expected model directories:

${MODEL_ROOT}/DeepSeek-R1-Distill-Qwen-1.5B
${MODEL_ROOT}/JustRL-DeepSeek-1.5B
${MODEL_ROOT}/Qwen3-4B-Base
${MODEL_ROOT}/Qwen3-4B

You may also set ACTOR_MODEL_PATH and REWARD_MODEL_PATH directly.

Training

We provide OPD and Prune-OPD scripts for two teacher-student pairs.

Script Description
experiments_scripts/opd-baseline-deepseek-r1-distill-qwen-1.5b-justrl-deepseek-1.5b.sh OPD baseline for DeepSeek-R1-Distill-Qwen-1.5B / JustRL-DeepSeek-1.5B
experiments_scripts/prune-opd-deepseek-r1-distill-qwen-1.5b-justrl-deepseek-1.5b.sh Prune-OPD for DeepSeek-R1-Distill-Qwen-1.5B / JustRL-DeepSeek-1.5B
experiments_scripts/opd-baseline-qwen3-4b-base-qwen3-4b-non-thinking.sh OPD baseline for Qwen3-4B-Base / Qwen3-4B (Non-thinking)
experiments_scripts/prune-opd-qwen3-4b-base-qwen3-4b-non-thinking.sh Prune-OPD for Qwen3-4B-Base / Qwen3-4B (Non-thinking)

Preview the resolved command without launching training:

DRY_RUN=1 DATA_ROOT=/path/to/datasets MODEL_ROOT=/path/to/models \
bash experiments_scripts/prune-opd-deepseek-r1-distill-qwen-1.5b-justrl-deepseek-1.5b.sh

Run training:

DATA_ROOT=/path/to/datasets MODEL_ROOT=/path/to/models \
bash experiments_scripts/prune-opd-qwen3-4b-base-qwen3-4b-non-thinking.sh

Extra Hydra overrides can be appended:

bash experiments_scripts/prune-opd-qwen3-4b-base-qwen3-4b-non-thinking.sh \
  trainer.test_freq=10 actor_rollout_ref.actor.optim.lr=5e-7

Main defaults:

  • train data: DAPO-Math-17K
  • evaluation: AMC23, AIME24, AIME25, HMMT24, HMMT25
  • evaluation metric: Avg@16
  • max response length: 12288
  • validation max response length: 31744
  • rollout number: 4
  • mini-batch size: 64
  • log-prob top-k: 16
  • training steps: 203
  • Prune-OPD metric: overlap ratio, threshold 0.7

Logging is console-only by default. To enable W&B:

WANDB_API_KEY=<your-key> WANDB_MODE=online TRACKING_BACKENDS='[console,wandb]' \
bash experiments_scripts/opd-baseline-deepseek-r1-distill-qwen-1.5b-justrl-deepseek-1.5b.sh

Citation

If you find this repository useful, please cite our paper:

@misc{yang2026pruneopd,
  title={Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning},
  author={Zhicheng Yang and Zhijiang Guo and Yifan Song and Minrui Xu and Yongxin Wang and Yiwei Wang and Xiaodan Liang and Jing Tang},
  year={2026},
  eprint={2605.07804},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2605.07804}
}

Acknowledgement

This codebase builds on the excellent verl training framework and the THUNLP/OPD implementation for on-policy distillation.

About

The official implemention of "Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors