Skip to content

Latest commit

 

History

History
138 lines (100 loc) · 5.72 KB

File metadata and controls

138 lines (100 loc) · 5.72 KB

Language-based Trial and Error Fails in the Era of Experience

Sub-Scale Collaboration On Unseen Task (SCOUT)
Decoupling Exploration from Exploitation for Efficient LLM Agent Training

HOMEPAGE Paper Code Model Dataset

📖 Overview

SCOUT is a novel framework that addresses the inefficiency of Large Language Models (LLMs) in exploring unseen, non-linguistic environments (e.g., symbolic or spatial tasks).

While LLMs excel at exploitation (reasoning based on knowledge), they are computationally expensive and inefficient at exploration (trial-and-error). SCOUT decouples these two processes:

  1. Lightweight Scouts: Use small networks (MLPs/CNNs) to rapidly master environmental dynamics via standard RL.
  2. Sub-Scale Collaboration: Distill the scout's expert trajectories into the LLM via SFT.
  3. Evolution: Activate the LLM's latent world knowledge through multi-turn RL (PPO).

Empirically, SCOUT enables a Qwen2.5-3B model to achieve an average score of 0.86 on complex tasks (including Rubik's Cube and 2048), significantly performing proprietary models like Gemini-2.5-Pro (0.60), while reducing GPU hours by ~60%.

This repository is built upon the RAGEN framework.

🎁 Updates

🚀 The SCOUT Framework

SCOUT Framework Overview

The training pipeline consists of three distinct stages:

  1. Exploration Stage (Scout Training):

    • Agents: Small MLPs or CNNs ($~10^{-5}$B parameters).
    • Algorithm: DQN or PPO.
    • Goal: Efficiently map transition dynamics and generate expert trajectories ($\tau_{scout}$).
  2. Distillation Stage (SFT):

    • Process: Transform $\tau_{scout}$ into text-based dialogue formats using a deterministic Textualizer.
    • Goal: "Warm up" the LLM to understand the physics of the unseen task.
  3. Evolving Stage (Multi-turn RL):

    • Algorithm: Multi-turn PPO (via RAGEN).
    • Goal: Refine reasoning and enable the LLM to self-evolve beyond the scout's capabilities.

🛠️ Installation

# Clone the repository
git clone https://github.com/Harry-mic/SCOUT.git
cd SCOUT

# Setup the environment (based on RAGEN)
bash scripts/setup_ragen.sh

🎮 Environments

We introduce several OOD (Out-of-Distribution) symbolic and spatial tasks:

Rubik's Cube: Restore a 2x2 scrambled cube (spatial reasoning).

2048: Long-horizon planning (>800 turns).

Sudoku: Logic-based constraint satisfaction.

Sokoban: Box-pushing planning task.

FrozenLake: Stochastic navigation (Static & Slippery variants).

Bandit: Fundamental RL benchmark.

⚡ Usage

1. Exploration Stage (Train Scouts)

Train lightweight scouts (MLP/CNN) to collect expert trajectories.

# Example: Train a DQN scout for Frozenlake and collected the trajectories as runs_scouts
python scout_dqn/dqn_frozenlake.py --track

2. Distillation Stage (SFT)

Textualizer the collected datasets.

# Textualizer from one-hot vectors to language dialogues.
python scripts/Textualizer_frozenlake.py  runs_scouts/Frozenlake_dqn_*** --step step_***

Fine-tune the base LLM on the collected trajectories. We utilize LLaMA-Factory for this stage.

# Run SFT on previous collected dialogues.
llama-factory train xxx.yaml

3. Evolving Stage (Multi-turn RL)

Run multi-turn PPO on the SFT model using the RAGEN infrastructure.

Start Training:

bash scripts/example_bash.sh

📊 Performance

SCOUT achieves state-of-the-art performance on unseen tasks while saving 60% of computational costs compared to direct RL training.

SCOUT Framework Overview

📂 Repository Structure

SCOUT/
├── ragen/                  # Core RAGEN framework (Env Manager, Context Manager)
├── scout_dqn/              # Lightweight scout training (DQN) & Textualizers
├── config/                 # Hydra configurations for PPO/GRPO
├── scripts/                # Setup and utility scripts
└── train.py                # Main entry point for Evolving Stage

📜 Citation

If you find SCOUT useful for your research, please cite our paper:

@article{wang2026language,
  title={Language-based Trial and Error Falls Behind in the Era of Experience},
  author={Wang, Haoyu and Ma, Guozheng and Cui, Shugang and Kong, Yilun and Luo, Haotian and Shen, Li and Gao, Mengya and Wu, Yichao and Wang, Xiaogang and Tao, Dacheng},
  journal={arXiv preprint arXiv:2601.21754},
  year={2026}
}

Acknowledgements

This codebase is built upon RAGEN. We thank the RAGEN team for their infrastructure support.