Skip to content

RajatDandekar/RL2F

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RL2F: Replicating "Reinforcement Learning from Language Feedback"

Paper: Improving Interactive In-Context Learning from Natural Language Feedback (Klissarov et al., 2026)

Replication by: Rajat Dandekar · Vizuara

Infrastructure: 2× NVIDIA A100 80GB (RunPod) · Gemma 3 12B · ~65 GPU-hours total


Overview

This repository contains a full end-to-end replication of the RL2F paper, which proposes training LLMs to learn from corrective natural language feedback through information asymmetry between a teacher and student model.

The key idea: a teacher model sees both the problem and the ground-truth solution, while a student model only sees the problem. The teacher provides hints (never the answer) to guide the student. The student is then trained via GRPO (Group Relative Policy Optimization) to improve its ability to incorporate feedback across multiple turns.

Architecture

What We Built

Complete Training Pipeline (3 methods)

Method Description Training Time Checkpoint
SFT Baseline Supervised fine-tuning on teacher-student dialogues ~6 hours 1.04 GB (LoRA r=64)
RLVR Baseline Single-turn RL (no teacher feedback) ~12 hours 274 MB (LoRA r=16)
RL2F Multi-turn RL with teacher feedback (paper's method) ~46 hours 274 MB (LoRA r=16)

Data Pipeline

  • 3,928 Omni-MATH problems with multi-turn teacher-student rollouts
  • 7,162 training turns with group size G=4 completions per prompt
  • Teacher guardrails: string-match + LLM judge to prevent answer leakage

Evaluation Suite

  • 4 benchmarks: Omni-MATH, Coding, Logic Puzzles, Maze Navigation
  • Multi-turn evaluation: Up to 5 teacher-student turns per problem
  • Fast batched inference via vLLM with LoRA hot-swapping (100x faster than HF generate)
  • Metrics: Cumulative accuracy per turn, multi-turn gain, plasticity, correction success rate

Training Results

SFT Baseline — Converged Successfully

The SFT baseline trained on teacher-student dialogue data converged smoothly over 3 epochs, achieving a final loss of 0.68 with 81.4% token accuracy.

SFT Training

SFT Configuration:

  • LoRA rank: 64, alpha: 128
  • Learning rate: 2e-5 (cosine decay)
  • 3 epochs, batch size 4, gradient accumulation 8
  • Final adapter size: 1.04 GB (834 tensors)

GRPO Training (RLVR & RL2F)

Both RL methods used GRPO with group size G=4, KL penalty β=0.1, and clip range 0.2.

Training Comparison

RL2F (GRPO) Training Curve:

GRPO Training Loss

RLVR Baseline Training Curve:

RLVR Training Loss

Key observations from GRPO training:

  • Both RLVR and RL2F losses diverged to large negative values (−44K and −67K respectively)
  • SFT converged stably to 0.68
  • The divergence pattern is consistent across both RL methods, pointing to a systematic issue with reward sparsity on competition-level math

Reward Signal Analysis

A critical finding: the reward distribution in our rollout data was extremely sparse, with 74.5% of all rewards being zero and only 1.9% of turns containing a correct answer.

Reward Distribution

This is the core challenge: Omni-MATH is competition-level mathematics that Gemma 3 12B cannot reliably solve even in its base form. With near-zero positive reward signal, GRPO has no meaningful gradient to learn from.


Evaluation Results

We evaluated all 4 model variants (base, SFT, RLVR, RL2F) across 4 benchmarks with batched vLLM inference:

Evaluation Heatmap

Plasticity Analysis

One of the paper's key metrics is plasticity — whether the model actually changes its answer after receiving teacher feedback. Higher plasticity indicates the model is genuinely processing the feedback rather than ignoring it.

Plasticity Comparison

Findings:

  • Base model shows the highest plasticity (67-93%), meaning it naturally adjusts responses to feedback
  • SFT has reduced plasticity (3-29%), as supervised training makes outputs more deterministic
  • RL-trained models show collapsed plasticity, indicating the GRPO optimization pushed them toward fixed output patterns

Project Timeline

Timeline

Phase Date Duration Details
Rollout Generation Mar 31–Apr 1 ~8 hours 3,928 problems × 4 completions × up to 5 turns
SFT Training Apr 1 ~6 hours 3 epochs on dialogue data
RLVR Training Apr 2 ~12 hours 3 epochs single-turn GRPO
RL2F Training Apr 3–4 ~46 hours 3 epochs multi-turn GRPO
Evaluation Apr 5 ~1 hour 4 models × 4 benchmarks via vLLM

Total compute: 65 GPU-hours on 2× A100 80GB ($130 on RunPod)


Key Lessons Learned

1. Reward Sparsity is the Bottleneck

The paper uses Gemma 3 12B on Omni-MATH, which is competition-level math. Our base model achieves 0% solve rate on these problems, meaning GRPO receives essentially no positive reward signal. This leads to policy collapse — the model learns to avoid generating coherent reasoning rather than improving at math.

Takeaway: For RL on reasoning tasks, the base model must be able to solve at least 10-20% of problems to provide meaningful reward gradients.

2. KL Penalty Dynamics Matter

With β=0.1 and sparse rewards, the KL penalty term dominates the loss function. When most advantages are zero, the effective gradient pushes the model away from generating its training distribution (coherent math reasoning), rather than toward better solutions.

3. SFT Provides a Strong Foundation

Despite being the simplest method, SFT on teacher-student dialogues produced a stable, well-behaved model. It learned the structured output format and could process feedback, even if plasticity was lower than the base model.

4. vLLM is Essential for Multi-Turn Evaluation

Our initial HuggingFace-based evaluation estimated ~8 days for the full benchmark suite. Switching to vLLM with batched generation and LoRA hot-swapping brought this down to ~1 hour (100x speedup).

5. What We Would Change

For a successful replication:

  • Use easier benchmarks (GSM8K, MATH) where Gemma 3 12B can solve 15-30% of problems
  • Filter training rollouts to only include groups with meaningful reward variance
  • Lower KL penalty from β=0.1 to β=0.01 for sparse reward settings
  • Add reward shaping with intermediate progress rewards, not just binary correctness

Repository Structure

RL2F/
├── configs/
│   └── default.yaml              # Full training configuration
├── src/
│   ├── data/
│   │   └── omni_math.py          # Omni-MATH data loading & answer checking
│   ├── evaluation/
│   │   ├── benchmarks.py         # Benchmark loaders (coding, puzzles, mazes)
│   │   ├── evaluator.py          # Multi-turn evaluation framework
│   │   └── metrics.py            # Cumulative accuracy, plasticity, CSR
│   ├── models/
│   │   ├── student.py            # Student model (problem-solving)
│   │   └── teacher.py            # Teacher model (hint generation)
│   ├── training/
│   │   ├── grpo.py               # GRPO implementation
│   │   ├── rewards.py            # Reward computation (correctness + format)
│   │   ├── rlvr_baseline.py      # Single-turn RL baseline
│   │   ├── self_play.py          # Multi-turn self-play rollout generation
│   │   └── sft_baseline.py       # Supervised fine-tuning baseline
│   └── utils/
│       └── prompts.py            # All prompt templates
├── scripts/
│   ├── train_all_vllm.py         # Full training pipeline (all 3 methods)
│   ├── train_rl2f_only.py        # RL2F training only
│   ├── evaluate.py               # HuggingFace-based evaluation
│   ├── evaluate_vllm.py          # Fast vLLM-based batched evaluation
│   ├── generate_dialogues.py     # Rollout generation
│   └── visualize_results.py      # Results visualization
├── figures/                      # All generated plots
├── generate_figures.py           # Figure generation script
└── requirements.txt

Setup & Reproduction

Requirements

pip install -r requirements.txt

Hardware: 2× NVIDIA A100 80GB (or equivalent with ≥40GB VRAM per GPU)

Step 1: Generate Rollouts

python scripts/generate_dialogues.py \
    --config configs/default.yaml \
    --output outputs/rollouts.json

Step 2: Train All Methods

# SFT Baseline
python scripts/train_baselines.py --method sft --config configs/default.yaml

# RLVR Baseline
python scripts/train_baselines.py --method rlvr --config configs/default.yaml

# RL2F (paper's method)
python scripts/train_rl2f_only.py --config configs/default.yaml

Step 3: Evaluate

# Fast evaluation with vLLM (recommended)
python scripts/evaluate_vllm.py \
    --model google/gemma-3-12b-it \
    --benchmarks omni_math,coding,puzzle,maze \
    --tensor-parallel-size 2

# Generate figures
python generate_figures.py

Configuration

All hyperparameters are in configs/default.yaml:

Parameter Value Description
Model Gemma 3 12B IT Base model for both teacher and student
LoRA r (SFT) 64 Higher rank for supervised learning
LoRA r (RL) 16 Lower rank for RL fine-tuning
GRPO group size 4 Completions per prompt
GRPO β (KL) 0.1 KL divergence penalty
GRPO clip range 0.2 PPO-style clipping
Learning rate 1e-5 With cosine decay
Batch size 4 Per-GPU
Grad accumulation 8 Effective batch = 32
Max turns 5 Teacher-student exchanges
Training epochs 3 For all methods

Citation

@article{klissarov2026improving,
  title={Improving Interactive In-Context Learning from Natural Language Feedback},
  author={Klissarov, Martin and Mazoure, Bogdan and Piche, Alexandre and
          Li, Liam and Bacon, Pierre-Luc and Precup, Doina},
  journal={arXiv preprint arXiv:2602.16066},
  year={2026}
}

Acknowledgments

  • Paper authors for the RL2F framework and experimental design
  • Google for the Gemma 3 12B model
  • RunPod for cloud GPU infrastructure
  • vLLM for high-throughput inference with LoRA support
  • Built as part of the Vizuara Modern Robot Learning Bootcamp

About

Replicating 'Reinforcement Learning from Language Feedback' (Klissarov et al., 2026) — Gemma 3 12B, GRPO, multi-turn teacher-student training on Omni-MATH

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors