[RFC] Multi-turn OPD for agent with a temporal curriculum

### Your Question

I want to contribute **TCOD** (multi-turn OPD + temporal curriculum) on top of the existing single-turn OPD — reuse, not fork. Confirming design/placement before coding.

**Problem.** Single-turn OPD on long-horizon agents → Trajectory-Level KL Instability: per-turn KL grows with turn index, trajectory KL escalates, SR collapses (Qwen3-1.7B → ~0 SR).

**Method.** Same KL-to-teacher objective; only grow trajectory depth `k = min(k_start + floor(rollout_id / eta), k_max)`.
- **F2B**: student rolls out first `k` turns only.
- **B2F**: teacher replays first `L-k` turns (`loss_mask=0`), student takes remaining `k`.

Paper: +~15 SR over vanilla OPD, stable KL, ~32% less train time.

**Maps onto vime as one new generate function:**

| Need | Reuse | New |
| --- | --- | --- |
| teacher logprobs + KL | `--custom-rm-path …on_policy_distillation.reward_func` + `--custom-reward-post-process-path …post_process_rewards`, `--use-opd --opd-type vllm` | — |
| multi-turn loop | `--custom-generate-function-path` pattern | `vime/rollout/tcod.py:generate` |
| curriculum `k` | `rollout_id` (already passed in) | compute `k` inside `generate` |
| B2F prefix mask | `Sample.loss_mask` | set `=0` on teacher-prefix turns |

No change to loss/advantage code, no new teacher backend, no new abstraction (backend-agnostic, no GPU/NPU branch).

```bash
# reused
--use-opd --opd-type vllm --opd-kl-coef 1.0
--custom-rm-path vime.rollout.on_policy_distillation.reward_func
--custom-reward-post-process-path vime.rollout.on_policy_distillation.post_process_rewards
# new
--custom-generate-function-path vime.rollout.tcod.generate
--opd-curriculum --opd-curriculum-variant f2b   # f2b | b2f
--opd-k-start 1 --opd-eta 2 --opd-k-max <max_turns>
--opd-b2f-trajectory-path <traj.jsonl>          # b2f only
```



### What I've Tried

- Single-turn OPD exists: `vime/rollout/on_policy_distillation.py` (`reward_func` / `post_process_rewards`), `--use-opd --opd-type vllm --opd-kl-coef`, CI `tests/test_qwen2.5_0.5B_opd_vllm.py`. It's single-turn (gsm8k).
- Multi-turn rollout exists: `examples/multi_agent/rollout_with_multi_agents.py`, `examples/geo3k_vlm_multi_turn/rollout.py`, registered via `--custom-generate-function-path` (`generate(args, rollout_id, data_source, evaluation=False) -> RolloutFnTrainOutput`). Async: `vime/rollout/fully_async_rollout.py`.
- Gap: no multi-turn OPD. Naive multi-turn OPD is unstable. I authored TCOD (arXiv:2604.24005), which fixes this and can reuse the OPD path above.

### Environment (if relevant)

- vime version:
- Python version:
- PyTorch version:
- CUDA version:
- GPU type and count:
- OS:


### Additional Context

**Plan.** (1) multi-turn OPD baseline + KL logging; (2) F2B; (3) B2F + prefix mask; (4) CI test (mirror `test_qwen2.5_0.5B_opd_vllm.py`) + `examples/tcod/` + EN/ZH docs.

**Questions.**
1. Canonical flag: examples use `--custom-generate-function-path`, the `add-rollout-function` skill says `--rollout-function-path`. Which? And `vime/rollout/tcod.py` vs `examples/tcod/`?
2. `--opd-curriculum*` naming OK?
3. Do multi-turn envs support reset-to-prefix-state (needed for B2F)? If not, gate B2F; F2B is the fallback.
4. Can you assign this to me? I'll start with (1) and open a DCO-signed `[Rollout]` draft PR.

**Refs.** TCOD arXiv:2604.24005; `vime/rollout/on_policy_distillation.py`; vime is derived from slime.

### Pre-submission Checklist

- [x] I have read the [CONTRIBUTING.md](https://github.com/vllm-project/vime/blob/main/CONTRIBUTING.md) and understand the collaboration scope.
- [x] I have read the [documentation](https://vllm-project.github.io/vime/) and [FAQ](https://vllm-project.github.io/vime/en/get_started/qa.html) and my question is not answered there.
- [x] I have searched for [existing issues](https://github.com/vllm-project/vime/issues) and my question has not been asked before.

Need	Reuse	New
teacher logprobs + KL	`--custom-rm-path …on_policy_distillation.reward_func` + `--custom-reward-post-process-path …post_process_rewards`, `--use-opd --opd-type vllm`	—
multi-turn loop	`--custom-generate-function-path` pattern	`vime/rollout/tcod.py:generate`
curriculum `k`	`rollout_id` (already passed in)	compute `k` inside `generate`
B2F prefix mask	`Sample.loss_mask`	set `=0` on teacher-prefix turns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Multi-turn OPD for agent with a temporal curriculum #231

Your Question

What I've Tried

Environment (if relevant)

Additional Context

Pre-submission Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[RFC] Multi-turn OPD for agent with a temporal curriculum #231

Description

Your Question

What I've Tried

Environment (if relevant)

Additional Context

Pre-submission Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions