Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

Reliability-aware pruning for scalable on-policy distillation.

Setup

We recommend a CUDA 12 machine with 8 NVIDIA GPUs. The training code is based on verl.

# 1. Clone and enter the repository.
git clone <repo-url> prune-opd
cd prune-opd

# 2. Create the environment.
conda create -n opd python=3.12 -y
conda activate opd

# 3. Install verl runtime dependencies.
cd verl
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
cd ..

# 4. Install local packages.
pip install -e ./verl
pip install math-verify

If the FlashAttention wheel selected by verl/scripts/install_vllm_sglang_mcore.sh does not match your CUDA/PyTorch platform, install the matching wheel manually and rerun the remaining pip installs.

Data and Models

Set the data and model roots before running scripts:

export DATA_ROOT=/path/to/datasets
export MODEL_ROOT=/path/to/models

Expected data layout:

${DATA_ROOT}/dapo-math-17k.parquet
${DATA_ROOT}/test_data/AMC23/test.parquet
${DATA_ROOT}/test_data/AIME24/test.parquet
${DATA_ROOT}/test_data/AIME25/test.parquet
${DATA_ROOT}/test_data/HMMT24/test.parquet
${DATA_ROOT}/test_data/HMMT25/test.parquet

Expected model directories:

${MODEL_ROOT}/DeepSeek-R1-Distill-Qwen-1.5B
${MODEL_ROOT}/JustRL-DeepSeek-1.5B
${MODEL_ROOT}/Qwen3-4B-Base
${MODEL_ROOT}/Qwen3-4B

You may also set ACTOR_MODEL_PATH and REWARD_MODEL_PATH directly.

Training

We provide OPD and Prune-OPD scripts for two teacher-student pairs.

Script	Description
`experiments_scripts/opd-baseline-deepseek-r1-distill-qwen-1.5b-justrl-deepseek-1.5b.sh`	OPD baseline for DeepSeek-R1-Distill-Qwen-1.5B / JustRL-DeepSeek-1.5B
`experiments_scripts/prune-opd-deepseek-r1-distill-qwen-1.5b-justrl-deepseek-1.5b.sh`	Prune-OPD for DeepSeek-R1-Distill-Qwen-1.5B / JustRL-DeepSeek-1.5B
`experiments_scripts/opd-baseline-qwen3-4b-base-qwen3-4b-non-thinking.sh`	OPD baseline for Qwen3-4B-Base / Qwen3-4B (Non-thinking)
`experiments_scripts/prune-opd-qwen3-4b-base-qwen3-4b-non-thinking.sh`	Prune-OPD for Qwen3-4B-Base / Qwen3-4B (Non-thinking)

Preview the resolved command without launching training:

DRY_RUN=1 DATA_ROOT=/path/to/datasets MODEL_ROOT=/path/to/models \
bash experiments_scripts/prune-opd-deepseek-r1-distill-qwen-1.5b-justrl-deepseek-1.5b.sh

Run training:

DATA_ROOT=/path/to/datasets MODEL_ROOT=/path/to/models \
bash experiments_scripts/prune-opd-qwen3-4b-base-qwen3-4b-non-thinking.sh

Extra Hydra overrides can be appended:

bash experiments_scripts/prune-opd-qwen3-4b-base-qwen3-4b-non-thinking.sh \
  trainer.test_freq=10 actor_rollout_ref.actor.optim.lr=5e-7

Main defaults:

train data: DAPO-Math-17K
evaluation: AMC23, AIME24, AIME25, HMMT24, HMMT25
evaluation metric: Avg@16
max response length: 12288
validation max response length: 31744
rollout number: 4
mini-batch size: 64
log-prob top-k: 16
training steps: 203
Prune-OPD metric: overlap ratio, threshold 0.7

Logging is console-only by default. To enable W&B:

WANDB_API_KEY=<your-key> WANDB_MODE=online TRACKING_BACKENDS='[console,wandb]' \
bash experiments_scripts/opd-baseline-deepseek-r1-distill-qwen-1.5b-justrl-deepseek-1.5b.sh

Citation

If you find this repository useful, please cite our paper:

@misc{yang2026pruneopd,
  title={Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning},
  author={Zhicheng Yang and Zhijiang Guo and Yifan Song and Minrui Xu and Yongxin Wang and Yiwei Wang and Xiaodan Liang and Jing Tang},
  year={2026},
  eprint={2605.07804},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2605.07804}
}

Acknowledgement

This codebase builds on the excellent verl training framework and the THUNLP/OPD implementation for on-policy distillation.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LlamaFactory		LlamaFactory
datasets		datasets
experiments_scripts		experiments_scripts
figs		figs
our_scripts		our_scripts
scripts		scripts
verl		verl
.codex		.codex
.gitignore		.gitignore
README.md		README.md
grpo.sh		grpo.sh
on_policy_distillation.sh		on_policy_distillation.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

Setup

Data and Models

Training

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

Setup

Data and Models

Training

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages