RL2F: Replicating "Reinforcement Learning from Language Feedback"

Paper: Improving Interactive In-Context Learning from Natural Language Feedback (Klissarov et al., 2026)

Replication by: Rajat Dandekar · Vizuara

Infrastructure: 2× NVIDIA A100 80GB (RunPod) · Gemma 3 12B · ~65 GPU-hours total

Overview

This repository contains a full end-to-end replication of the RL2F paper, which proposes training LLMs to learn from corrective natural language feedback through information asymmetry between a teacher and student model.

The key idea: a teacher model sees both the problem and the ground-truth solution, while a student model only sees the problem. The teacher provides hints (never the answer) to guide the student. The student is then trained via GRPO (Group Relative Policy Optimization) to improve its ability to incorporate feedback across multiple turns.

What We Built

Complete Training Pipeline (3 methods)

Method	Description	Training Time	Checkpoint
SFT Baseline	Supervised fine-tuning on teacher-student dialogues	~6 hours	1.04 GB (LoRA r=64)
RLVR Baseline	Single-turn RL (no teacher feedback)	~12 hours	274 MB (LoRA r=16)
RL2F	Multi-turn RL with teacher feedback (paper's method)	~46 hours	274 MB (LoRA r=16)

Data Pipeline

3,928 Omni-MATH problems with multi-turn teacher-student rollouts
7,162 training turns with group size G=4 completions per prompt
Teacher guardrails: string-match + LLM judge to prevent answer leakage

Evaluation Suite

4 benchmarks: Omni-MATH, Coding, Logic Puzzles, Maze Navigation
Multi-turn evaluation: Up to 5 teacher-student turns per problem
Fast batched inference via vLLM with LoRA hot-swapping (100x faster than HF generate)
Metrics: Cumulative accuracy per turn, multi-turn gain, plasticity, correction success rate

Training Results

SFT Baseline — Converged Successfully

The SFT baseline trained on teacher-student dialogue data converged smoothly over 3 epochs, achieving a final loss of 0.68 with 81.4% token accuracy.

SFT Configuration:

LoRA rank: 64, alpha: 128
Learning rate: 2e-5 (cosine decay)
3 epochs, batch size 4, gradient accumulation 8
Final adapter size: 1.04 GB (834 tensors)

GRPO Training (RLVR & RL2F)

Both RL methods used GRPO with group size G=4, KL penalty β=0.1, and clip range 0.2.

RL2F (GRPO) Training Curve:

RLVR Baseline Training Curve:

Key observations from GRPO training:

Both RLVR and RL2F losses diverged to large negative values (−44K and −67K respectively)
SFT converged stably to 0.68
The divergence pattern is consistent across both RL methods, pointing to a systematic issue with reward sparsity on competition-level math

Reward Signal Analysis

A critical finding: the reward distribution in our rollout data was extremely sparse, with 74.5% of all rewards being zero and only 1.9% of turns containing a correct answer.

This is the core challenge: Omni-MATH is competition-level mathematics that Gemma 3 12B cannot reliably solve even in its base form. With near-zero positive reward signal, GRPO has no meaningful gradient to learn from.

Evaluation Results

We evaluated all 4 model variants (base, SFT, RLVR, RL2F) across 4 benchmarks with batched vLLM inference:

Plasticity Analysis

One of the paper's key metrics is plasticity — whether the model actually changes its answer after receiving teacher feedback. Higher plasticity indicates the model is genuinely processing the feedback rather than ignoring it.

Findings:

Base model shows the highest plasticity (67-93%), meaning it naturally adjusts responses to feedback
SFT has reduced plasticity (3-29%), as supervised training makes outputs more deterministic
RL-trained models show collapsed plasticity, indicating the GRPO optimization pushed them toward fixed output patterns

Project Timeline

Phase	Date	Duration	Details
Rollout Generation	Mar 31–Apr 1	~8 hours	3,928 problems × 4 completions × up to 5 turns
SFT Training	Apr 1	~6 hours	3 epochs on dialogue data
RLVR Training	Apr 2	~12 hours	3 epochs single-turn GRPO
RL2F Training	Apr 3–4	~46 hours	3 epochs multi-turn GRPO
Evaluation	Apr 5	~1 hour	4 models × 4 benchmarks via vLLM

Total compute: ~~65 GPU-hours on 2× A100 80GB (~~$130 on RunPod)

Key Lessons Learned

1. Reward Sparsity is the Bottleneck

The paper uses Gemma 3 12B on Omni-MATH, which is competition-level math. Our base model achieves 0% solve rate on these problems, meaning GRPO receives essentially no positive reward signal. This leads to policy collapse — the model learns to avoid generating coherent reasoning rather than improving at math.

Takeaway: For RL on reasoning tasks, the base model must be able to solve at least 10-20% of problems to provide meaningful reward gradients.

2. KL Penalty Dynamics Matter

With β=0.1 and sparse rewards, the KL penalty term dominates the loss function. When most advantages are zero, the effective gradient pushes the model away from generating its training distribution (coherent math reasoning), rather than toward better solutions.

3. SFT Provides a Strong Foundation

Despite being the simplest method, SFT on teacher-student dialogues produced a stable, well-behaved model. It learned the structured output format and could process feedback, even if plasticity was lower than the base model.

4. vLLM is Essential for Multi-Turn Evaluation

Our initial HuggingFace-based evaluation estimated ~8 days for the full benchmark suite. Switching to vLLM with batched generation and LoRA hot-swapping brought this down to ~1 hour (100x speedup).

5. What We Would Change

For a successful replication:

Use easier benchmarks (GSM8K, MATH) where Gemma 3 12B can solve 15-30% of problems
Filter training rollouts to only include groups with meaningful reward variance
Lower KL penalty from β=0.1 to β=0.01 for sparse reward settings
Add reward shaping with intermediate progress rewards, not just binary correctness

Repository Structure

RL2F/
├── configs/
│   └── default.yaml              # Full training configuration
├── src/
│   ├── data/
│   │   └── omni_math.py          # Omni-MATH data loading & answer checking
│   ├── evaluation/
│   │   ├── benchmarks.py         # Benchmark loaders (coding, puzzles, mazes)
│   │   ├── evaluator.py          # Multi-turn evaluation framework
│   │   └── metrics.py            # Cumulative accuracy, plasticity, CSR
│   ├── models/
│   │   ├── student.py            # Student model (problem-solving)
│   │   └── teacher.py            # Teacher model (hint generation)
│   ├── training/
│   │   ├── grpo.py               # GRPO implementation
│   │   ├── rewards.py            # Reward computation (correctness + format)
│   │   ├── rlvr_baseline.py      # Single-turn RL baseline
│   │   ├── self_play.py          # Multi-turn self-play rollout generation
│   │   └── sft_baseline.py       # Supervised fine-tuning baseline
│   └── utils/
│       └── prompts.py            # All prompt templates
├── scripts/
│   ├── train_all_vllm.py         # Full training pipeline (all 3 methods)
│   ├── train_rl2f_only.py        # RL2F training only
│   ├── evaluate.py               # HuggingFace-based evaluation
│   ├── evaluate_vllm.py          # Fast vLLM-based batched evaluation
│   ├── generate_dialogues.py     # Rollout generation
│   └── visualize_results.py      # Results visualization
├── figures/                      # All generated plots
├── generate_figures.py           # Figure generation script
└── requirements.txt

Setup & Reproduction

Requirements

pip install -r requirements.txt

Hardware: 2× NVIDIA A100 80GB (or equivalent with ≥40GB VRAM per GPU)

Step 1: Generate Rollouts

python scripts/generate_dialogues.py \
    --config configs/default.yaml \
    --output outputs/rollouts.json

Step 2: Train All Methods

# SFT Baseline
python scripts/train_baselines.py --method sft --config configs/default.yaml

# RLVR Baseline
python scripts/train_baselines.py --method rlvr --config configs/default.yaml

# RL2F (paper's method)
python scripts/train_rl2f_only.py --config configs/default.yaml

Step 3: Evaluate

# Fast evaluation with vLLM (recommended)
python scripts/evaluate_vllm.py \
    --model google/gemma-3-12b-it \
    --benchmarks omni_math,coding,puzzle,maze \
    --tensor-parallel-size 2

# Generate figures
python generate_figures.py

Configuration

All hyperparameters are in configs/default.yaml:

Parameter	Value	Description
Model	Gemma 3 12B IT	Base model for both teacher and student
LoRA r (SFT)	64	Higher rank for supervised learning
LoRA r (RL)	16	Lower rank for RL fine-tuning
GRPO group size	4	Completions per prompt
GRPO β (KL)	0.1	KL divergence penalty
GRPO clip range	0.2	PPO-style clipping
Learning rate	1e-5	With cosine decay
Batch size	4	Per-GPU
Grad accumulation	8	Effective batch = 32
Max turns	5	Teacher-student exchanges
Training epochs	3	For all methods

Citation

@article{klissarov2026improving,
  title={Improving Interactive In-Context Learning from Natural Language Feedback},
  author={Klissarov, Martin and Mazoure, Bogdan and Piche, Alexandre and
          Li, Liam and Bacon, Pierre-Luc and Precup, Doina},
  journal={arXiv preprint arXiv:2602.16066},
  year={2026}
}

Acknowledgments

Paper authors for the RL2F framework and experimental design
Google for the Gemma 3 12B model
RunPod for cloud GPU infrastructure
vLLM for high-throughput inference with LoRA support
Built as part of the Vizuara Modern Robot Learning Bootcamp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RL2F: Replicating "Reinforcement Learning from Language Feedback"

Overview

What We Built

Complete Training Pipeline (3 methods)

Data Pipeline

Evaluation Suite

Training Results

SFT Baseline — Converged Successfully

GRPO Training (RLVR & RL2F)

Reward Signal Analysis

Evaluation Results

Plasticity Analysis

Project Timeline

Key Lessons Learned

1. Reward Sparsity is the Bottleneck

2. KL Penalty Dynamics Matter

3. SFT Provides a Strong Foundation

4. vLLM is Essential for Multi-Turn Evaluation

5. What We Would Change

Repository Structure

Setup & Reproduction

Requirements

Step 1: Generate Rollouts

Step 2: Train All Methods

Step 3: Evaluate

Configuration

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
data		data
figures		figures
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
generate_figures.py		generate_figures.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RL2F: Replicating "Reinforcement Learning from Language Feedback"

Overview

What We Built

Complete Training Pipeline (3 methods)

Data Pipeline

Evaluation Suite

Training Results

SFT Baseline — Converged Successfully

GRPO Training (RLVR & RL2F)

Reward Signal Analysis

Evaluation Results

Plasticity Analysis

Project Timeline

Key Lessons Learned

1. Reward Sparsity is the Bottleneck

2. KL Penalty Dynamics Matter

3. SFT Provides a Strong Foundation

4. vLLM is Essential for Multi-Turn Evaluation

5. What We Would Change

Repository Structure

Setup & Reproduction

Requirements

Step 1: Generate Rollouts

Step 2: Train All Methods

Step 3: Evaluate

Configuration

Citation

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages