Official repository for the paper:
DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning
- arXiv: https://arxiv.org/abs/2510.09255
- Title: DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning
- Authors: Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao
- Version: arXiv v4, revised on March 19, 2026
Dynamic-filter Sequence-level Policy Optimization (DSPO) is an RL algorithm designed for stable and efficient agentic search and reasoning.
DSPO trains models to interleave multi-turn search and reasoning through reinforcement learning, using sequence-level optimization and dynamic sample filtering to improve training stability and performance.
- [2026-03-19] Paper updated to arXiv v4.
- [2025-10-10] Paper first released on arXiv.
Code and training details will be released soon.
If you find this work useful, please cite:
@article{gu2025dspo,
title={DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning},
author={Gu, Chenyang and Pu, Yewen and Yang, Bruce and Li, Xiaofan and Gao, Huan},
journal={arXiv preprint arXiv:2510.09255},
year={2025}
}The paper and figures are licensed under CC BY-SA 4.0.
Code, if released, will be licensed separately.