Skip to content

[RFC] Multi-turn OPD for agent with a temporal curriculum #231

Description

@kokolerk

Your Question

I want to contribute TCOD (multi-turn OPD + temporal curriculum) on top of the existing single-turn OPD — reuse, not fork. Confirming design/placement before coding.

Problem. Single-turn OPD on long-horizon agents → Trajectory-Level KL Instability: per-turn KL grows with turn index, trajectory KL escalates, SR collapses (Qwen3-1.7B → ~0 SR).

Method. Same KL-to-teacher objective; only grow trajectory depth k = min(k_start + floor(rollout_id / eta), k_max).

  • F2B: student rolls out first k turns only.
  • B2F: teacher replays first L-k turns (loss_mask=0), student takes remaining k.

Paper: +~15 SR over vanilla OPD, stable KL, ~32% less train time.

Maps onto vime as one new generate function:

Need Reuse New
teacher logprobs + KL --custom-rm-path …on_policy_distillation.reward_func + --custom-reward-post-process-path …post_process_rewards, --use-opd --opd-type vllm
multi-turn loop --custom-generate-function-path pattern vime/rollout/tcod.py:generate
curriculum k rollout_id (already passed in) compute k inside generate
B2F prefix mask Sample.loss_mask set =0 on teacher-prefix turns

No change to loss/advantage code, no new teacher backend, no new abstraction (backend-agnostic, no GPU/NPU branch).

# reused
--use-opd --opd-type vllm --opd-kl-coef 1.0
--custom-rm-path vime.rollout.on_policy_distillation.reward_func
--custom-reward-post-process-path vime.rollout.on_policy_distillation.post_process_rewards
# new
--custom-generate-function-path vime.rollout.tcod.generate
--opd-curriculum --opd-curriculum-variant f2b   # f2b | b2f
--opd-k-start 1 --opd-eta 2 --opd-k-max <max_turns>
--opd-b2f-trajectory-path <traj.jsonl>          # b2f only

What I've Tried

  • Single-turn OPD exists: vime/rollout/on_policy_distillation.py (reward_func / post_process_rewards), --use-opd --opd-type vllm --opd-kl-coef, CI tests/test_qwen2.5_0.5B_opd_vllm.py. It's single-turn (gsm8k).
  • Multi-turn rollout exists: examples/multi_agent/rollout_with_multi_agents.py, examples/geo3k_vlm_multi_turn/rollout.py, registered via --custom-generate-function-path (generate(args, rollout_id, data_source, evaluation=False) -> RolloutFnTrainOutput). Async: vime/rollout/fully_async_rollout.py.
  • Gap: no multi-turn OPD. Naive multi-turn OPD is unstable. I authored TCOD (arXiv:2604.24005), which fixes this and can reuse the OPD path above.

Environment (if relevant)

  • vime version:
  • Python version:
  • PyTorch version:
  • CUDA version:
  • GPU type and count:
  • OS:

Additional Context

Plan. (1) multi-turn OPD baseline + KL logging; (2) F2B; (3) B2F + prefix mask; (4) CI test (mirror test_qwen2.5_0.5B_opd_vllm.py) + examples/tcod/ + EN/ZH docs.

Questions.

  1. Canonical flag: examples use --custom-generate-function-path, the add-rollout-function skill says --rollout-function-path. Which? And vime/rollout/tcod.py vs examples/tcod/?
  2. --opd-curriculum* naming OK?
  3. Do multi-turn envs support reset-to-prefix-state (needed for B2F)? If not, gate B2F; F2B is the fallback.
  4. Can you assign this to me? I'll start with (1) and open a DCO-signed [Rollout] draft PR.

Refs. TCOD arXiv:2604.24005; vime/rollout/on_policy_distillation.py; vime is derived from slime.

Pre-submission Checklist

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions