Language-based Trial and Error Fails in the Era of Experience

Sub-Scale Collaboration On Unseen Task (SCOUT)
Decoupling Exploration from Exploitation for Efficient LLM Agent Training

📖 Overview

SCOUT is a novel framework that addresses the inefficiency of Large Language Models (LLMs) in exploring unseen, non-linguistic environments (e.g., symbolic or spatial tasks).

While LLMs excel at exploitation (reasoning based on knowledge), they are computationally expensive and inefficient at exploration (trial-and-error). SCOUT decouples these two processes:

Lightweight Scouts: Use small networks (MLPs/CNNs) to rapidly master environmental dynamics via standard RL.
Sub-Scale Collaboration: Distill the scout's expert trajectories into the LLM via SFT.
Evolution: Activate the LLM's latent world knowledge through multi-turn RL (PPO).

Empirically, SCOUT enables a Qwen2.5-3B model to achieve an average score of 0.86 on complex tasks (including Rubik's Cube and 2048), significantly performing proprietary models like Gemini-2.5-Pro (0.60), while reducing GPU hours by ~60%.

This repository is built upon the RAGEN framework.

🎁 Updates

2026.01.30: We release our paper: Language-based Trial and Error Falls Behind in the Era of Experience at arxiv and code.
2026.01.31: We release the multi-task models at huggingface.

🚀 The SCOUT Framework

The training pipeline consists of three distinct stages:

Exploration Stage (Scout Training):
- Agents: Small MLPs or CNNs ($~10^{-5}$B parameters).
- Algorithm: DQN or PPO.
- Goal: Efficiently map transition dynamics and generate expert trajectories ($\tau_{scout}$).
Distillation Stage (SFT):
- Process: Transform $\tau_{scout}$ into text-based dialogue formats using a deterministic Textualizer.
- Goal: "Warm up" the LLM to understand the physics of the unseen task.
Evolving Stage (Multi-turn RL):
- Algorithm: Multi-turn PPO (via RAGEN).
- Goal: Refine reasoning and enable the LLM to self-evolve beyond the scout's capabilities.

🛠️ Installation

# Clone the repository
git clone https://github.com/Harry-mic/SCOUT.git
cd SCOUT

# Setup the environment (based on RAGEN)
bash scripts/setup_ragen.sh

🎮 Environments

We introduce several OOD (Out-of-Distribution) symbolic and spatial tasks:

Rubik's Cube: Restore a 2x2 scrambled cube (spatial reasoning).

2048: Long-horizon planning (>800 turns).

Sudoku: Logic-based constraint satisfaction.

Sokoban: Box-pushing planning task.

FrozenLake: Stochastic navigation (Static & Slippery variants).

Bandit: Fundamental RL benchmark.

⚡ Usage

1. Exploration Stage (Train Scouts)

Train lightweight scouts (MLP/CNN) to collect expert trajectories.

# Example: Train a DQN scout for Frozenlake and collected the trajectories as runs_scouts
python scout_dqn/dqn_frozenlake.py --track

2. Distillation Stage (SFT)

Textualizer the collected datasets.

# Textualizer from one-hot vectors to language dialogues.
python scripts/Textualizer_frozenlake.py  runs_scouts/Frozenlake_dqn_*** --step step_***

Fine-tune the base LLM on the collected trajectories. We utilize LLaMA-Factory for this stage.

# Run SFT on previous collected dialogues.
llama-factory train xxx.yaml

3. Evolving Stage (Multi-turn RL)

Run multi-turn PPO on the SFT model using the RAGEN infrastructure.

Start Training:

bash scripts/example_bash.sh

📊 Performance

SCOUT achieves state-of-the-art performance on unseen tasks while saving 60% of computational costs compared to direct RL training.

📂 Repository Structure

SCOUT/
├── ragen/                  # Core RAGEN framework (Env Manager, Context Manager)
├── scout_dqn/              # Lightweight scout training (DQN) & Textualizers
├── config/                 # Hydra configurations for PPO/GRPO
├── scripts/                # Setup and utility scripts
└── train.py                # Main entry point for Evolving Stage

📜 Citation

If you find SCOUT useful for your research, please cite our paper:

@article{wang2026language,
  title={Language-based Trial and Error Falls Behind in the Era of Experience},
  author={Wang, Haoyu and Ma, Guozheng and Cui, Shugang and Kong, Yilun and Luo, Haotian and Shen, Li and Gao, Mengya and Wu, Yichao and Wang, Xiaogang and Tao, Dacheng},
  journal={arXiv preprint arXiv:2601.21754},
  year={2026}
}

Acknowledgements

This codebase is built upon RAGEN. We thank the RAGEN team for their infrastructure support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language-based Trial and Error Fails in the Era of Experience

📖 Overview

🎁 Updates

🚀 The SCOUT Framework

🛠️ Installation

🎮 Environments

⚡ Usage

1. Exploration Stage (Train Scouts)

2. Distillation Stage (SFT)

3. Evolving Stage (Multi-turn RL)

📊 Performance

📂 Repository Structure

📜 Citation

Acknowledgements

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Language-based Trial and Error Fails in the Era of Experience

📖 Overview

🎁 Updates

🚀 The SCOUT Framework

🛠️ Installation

🎮 Environments

⚡ Usage

1. Exploration Stage (Train Scouts)

2. Distillation Stage (SFT)

3. Evolving Stage (Multi-turn RL)

📊 Performance

📂 Repository Structure

📜 Citation

Acknowledgements