Your Question
I want to contribute TCOD (multi-turn OPD + temporal curriculum) on top of the existing single-turn OPD — reuse, not fork. Confirming design/placement before coding.
Problem. Single-turn OPD on long-horizon agents → Trajectory-Level KL Instability: per-turn KL grows with turn index, trajectory KL escalates, SR collapses (Qwen3-1.7B → ~0 SR).
Method. Same KL-to-teacher objective; only grow trajectory depth k = min(k_start + floor(rollout_id / eta), k_max).
- F2B: student rolls out first
k turns only.
- B2F: teacher replays first
L-k turns (loss_mask=0), student takes remaining k.
Paper: +~15 SR over vanilla OPD, stable KL, ~32% less train time.
Maps onto vime as one new generate function:
| Need |
Reuse |
New |
| teacher logprobs + KL |
--custom-rm-path …on_policy_distillation.reward_func + --custom-reward-post-process-path …post_process_rewards, --use-opd --opd-type vllm |
— |
| multi-turn loop |
--custom-generate-function-path pattern |
vime/rollout/tcod.py:generate |
curriculum k |
rollout_id (already passed in) |
compute k inside generate |
| B2F prefix mask |
Sample.loss_mask |
set =0 on teacher-prefix turns |
No change to loss/advantage code, no new teacher backend, no new abstraction (backend-agnostic, no GPU/NPU branch).
# reused
--use-opd --opd-type vllm --opd-kl-coef 1.0
--custom-rm-path vime.rollout.on_policy_distillation.reward_func
--custom-reward-post-process-path vime.rollout.on_policy_distillation.post_process_rewards
# new
--custom-generate-function-path vime.rollout.tcod.generate
--opd-curriculum --opd-curriculum-variant f2b # f2b | b2f
--opd-k-start 1 --opd-eta 2 --opd-k-max <max_turns>
--opd-b2f-trajectory-path <traj.jsonl> # b2f only
What I've Tried
- Single-turn OPD exists:
vime/rollout/on_policy_distillation.py (reward_func / post_process_rewards), --use-opd --opd-type vllm --opd-kl-coef, CI tests/test_qwen2.5_0.5B_opd_vllm.py. It's single-turn (gsm8k).
- Multi-turn rollout exists:
examples/multi_agent/rollout_with_multi_agents.py, examples/geo3k_vlm_multi_turn/rollout.py, registered via --custom-generate-function-path (generate(args, rollout_id, data_source, evaluation=False) -> RolloutFnTrainOutput). Async: vime/rollout/fully_async_rollout.py.
- Gap: no multi-turn OPD. Naive multi-turn OPD is unstable. I authored TCOD (arXiv:2604.24005), which fixes this and can reuse the OPD path above.
Environment (if relevant)
- vime version:
- Python version:
- PyTorch version:
- CUDA version:
- GPU type and count:
- OS:
Additional Context
Plan. (1) multi-turn OPD baseline + KL logging; (2) F2B; (3) B2F + prefix mask; (4) CI test (mirror test_qwen2.5_0.5B_opd_vllm.py) + examples/tcod/ + EN/ZH docs.
Questions.
- Canonical flag: examples use
--custom-generate-function-path, the add-rollout-function skill says --rollout-function-path. Which? And vime/rollout/tcod.py vs examples/tcod/?
--opd-curriculum* naming OK?
- Do multi-turn envs support reset-to-prefix-state (needed for B2F)? If not, gate B2F; F2B is the fallback.
- Can you assign this to me? I'll start with (1) and open a DCO-signed
[Rollout] draft PR.
Refs. TCOD arXiv:2604.24005; vime/rollout/on_policy_distillation.py; vime is derived from slime.
Pre-submission Checklist
Your Question
I want to contribute TCOD (multi-turn OPD + temporal curriculum) on top of the existing single-turn OPD — reuse, not fork. Confirming design/placement before coding.
Problem. Single-turn OPD on long-horizon agents → Trajectory-Level KL Instability: per-turn KL grows with turn index, trajectory KL escalates, SR collapses (Qwen3-1.7B → ~0 SR).
Method. Same KL-to-teacher objective; only grow trajectory depth
k = min(k_start + floor(rollout_id / eta), k_max).kturns only.L-kturns (loss_mask=0), student takes remainingk.Paper: +~15 SR over vanilla OPD, stable KL, ~32% less train time.
Maps onto vime as one new generate function:
--custom-rm-path …on_policy_distillation.reward_func+--custom-reward-post-process-path …post_process_rewards,--use-opd --opd-type vllm--custom-generate-function-pathpatternvime/rollout/tcod.py:generatekrollout_id(already passed in)kinsidegenerateSample.loss_mask=0on teacher-prefix turnsNo change to loss/advantage code, no new teacher backend, no new abstraction (backend-agnostic, no GPU/NPU branch).
What I've Tried
vime/rollout/on_policy_distillation.py(reward_func/post_process_rewards),--use-opd --opd-type vllm --opd-kl-coef, CItests/test_qwen2.5_0.5B_opd_vllm.py. It's single-turn (gsm8k).examples/multi_agent/rollout_with_multi_agents.py,examples/geo3k_vlm_multi_turn/rollout.py, registered via--custom-generate-function-path(generate(args, rollout_id, data_source, evaluation=False) -> RolloutFnTrainOutput). Async:vime/rollout/fully_async_rollout.py.Environment (if relevant)
Additional Context
Plan. (1) multi-turn OPD baseline + KL logging; (2) F2B; (3) B2F + prefix mask; (4) CI test (mirror
test_qwen2.5_0.5B_opd_vllm.py) +examples/tcod/+ EN/ZH docs.Questions.
--custom-generate-function-path, theadd-rollout-functionskill says--rollout-function-path. Which? Andvime/rollout/tcod.pyvsexamples/tcod/?--opd-curriculum*naming OK?[Rollout]draft PR.Refs. TCOD arXiv:2604.24005;
vime/rollout/on_policy_distillation.py; vime is derived from slime.Pre-submission Checklist