Paper: Improving Interactive In-Context Learning from Natural Language Feedback (Klissarov et al., 2026)
Replication by: Rajat Dandekar · Vizuara
Infrastructure: 2× NVIDIA A100 80GB (RunPod) · Gemma 3 12B · ~65 GPU-hours total
This repository contains a full end-to-end replication of the RL2F paper, which proposes training LLMs to learn from corrective natural language feedback through information asymmetry between a teacher and student model.
The key idea: a teacher model sees both the problem and the ground-truth solution, while a student model only sees the problem. The teacher provides hints (never the answer) to guide the student. The student is then trained via GRPO (Group Relative Policy Optimization) to improve its ability to incorporate feedback across multiple turns.
| Method | Description | Training Time | Checkpoint |
|---|---|---|---|
| SFT Baseline | Supervised fine-tuning on teacher-student dialogues | ~6 hours | 1.04 GB (LoRA r=64) |
| RLVR Baseline | Single-turn RL (no teacher feedback) | ~12 hours | 274 MB (LoRA r=16) |
| RL2F | Multi-turn RL with teacher feedback (paper's method) | ~46 hours | 274 MB (LoRA r=16) |
- 3,928 Omni-MATH problems with multi-turn teacher-student rollouts
- 7,162 training turns with group size G=4 completions per prompt
- Teacher guardrails: string-match + LLM judge to prevent answer leakage
- 4 benchmarks: Omni-MATH, Coding, Logic Puzzles, Maze Navigation
- Multi-turn evaluation: Up to 5 teacher-student turns per problem
- Fast batched inference via vLLM with LoRA hot-swapping (100x faster than HF generate)
- Metrics: Cumulative accuracy per turn, multi-turn gain, plasticity, correction success rate
The SFT baseline trained on teacher-student dialogue data converged smoothly over 3 epochs, achieving a final loss of 0.68 with 81.4% token accuracy.
SFT Configuration:
- LoRA rank: 64, alpha: 128
- Learning rate: 2e-5 (cosine decay)
- 3 epochs, batch size 4, gradient accumulation 8
- Final adapter size: 1.04 GB (834 tensors)
Both RL methods used GRPO with group size G=4, KL penalty β=0.1, and clip range 0.2.
RL2F (GRPO) Training Curve:
RLVR Baseline Training Curve:
Key observations from GRPO training:
- Both RLVR and RL2F losses diverged to large negative values (−44K and −67K respectively)
- SFT converged stably to 0.68
- The divergence pattern is consistent across both RL methods, pointing to a systematic issue with reward sparsity on competition-level math
A critical finding: the reward distribution in our rollout data was extremely sparse, with 74.5% of all rewards being zero and only 1.9% of turns containing a correct answer.
This is the core challenge: Omni-MATH is competition-level mathematics that Gemma 3 12B cannot reliably solve even in its base form. With near-zero positive reward signal, GRPO has no meaningful gradient to learn from.
We evaluated all 4 model variants (base, SFT, RLVR, RL2F) across 4 benchmarks with batched vLLM inference:
One of the paper's key metrics is plasticity — whether the model actually changes its answer after receiving teacher feedback. Higher plasticity indicates the model is genuinely processing the feedback rather than ignoring it.
Findings:
- Base model shows the highest plasticity (67-93%), meaning it naturally adjusts responses to feedback
- SFT has reduced plasticity (3-29%), as supervised training makes outputs more deterministic
- RL-trained models show collapsed plasticity, indicating the GRPO optimization pushed them toward fixed output patterns
| Phase | Date | Duration | Details |
|---|---|---|---|
| Rollout Generation | Mar 31–Apr 1 | ~8 hours | 3,928 problems × 4 completions × up to 5 turns |
| SFT Training | Apr 1 | ~6 hours | 3 epochs on dialogue data |
| RLVR Training | Apr 2 | ~12 hours | 3 epochs single-turn GRPO |
| RL2F Training | Apr 3–4 | ~46 hours | 3 epochs multi-turn GRPO |
| Evaluation | Apr 5 | ~1 hour | 4 models × 4 benchmarks via vLLM |
Total compute: 65 GPU-hours on 2× A100 80GB ($130 on RunPod)
The paper uses Gemma 3 12B on Omni-MATH, which is competition-level math. Our base model achieves 0% solve rate on these problems, meaning GRPO receives essentially no positive reward signal. This leads to policy collapse — the model learns to avoid generating coherent reasoning rather than improving at math.
Takeaway: For RL on reasoning tasks, the base model must be able to solve at least 10-20% of problems to provide meaningful reward gradients.
With β=0.1 and sparse rewards, the KL penalty term dominates the loss function. When most advantages are zero, the effective gradient pushes the model away from generating its training distribution (coherent math reasoning), rather than toward better solutions.
Despite being the simplest method, SFT on teacher-student dialogues produced a stable, well-behaved model. It learned the structured output format and could process feedback, even if plasticity was lower than the base model.
Our initial HuggingFace-based evaluation estimated ~8 days for the full benchmark suite. Switching to vLLM with batched generation and LoRA hot-swapping brought this down to ~1 hour (100x speedup).
For a successful replication:
- Use easier benchmarks (GSM8K, MATH) where Gemma 3 12B can solve 15-30% of problems
- Filter training rollouts to only include groups with meaningful reward variance
- Lower KL penalty from β=0.1 to β=0.01 for sparse reward settings
- Add reward shaping with intermediate progress rewards, not just binary correctness
RL2F/
├── configs/
│ └── default.yaml # Full training configuration
├── src/
│ ├── data/
│ │ └── omni_math.py # Omni-MATH data loading & answer checking
│ ├── evaluation/
│ │ ├── benchmarks.py # Benchmark loaders (coding, puzzles, mazes)
│ │ ├── evaluator.py # Multi-turn evaluation framework
│ │ └── metrics.py # Cumulative accuracy, plasticity, CSR
│ ├── models/
│ │ ├── student.py # Student model (problem-solving)
│ │ └── teacher.py # Teacher model (hint generation)
│ ├── training/
│ │ ├── grpo.py # GRPO implementation
│ │ ├── rewards.py # Reward computation (correctness + format)
│ │ ├── rlvr_baseline.py # Single-turn RL baseline
│ │ ├── self_play.py # Multi-turn self-play rollout generation
│ │ └── sft_baseline.py # Supervised fine-tuning baseline
│ └── utils/
│ └── prompts.py # All prompt templates
├── scripts/
│ ├── train_all_vllm.py # Full training pipeline (all 3 methods)
│ ├── train_rl2f_only.py # RL2F training only
│ ├── evaluate.py # HuggingFace-based evaluation
│ ├── evaluate_vllm.py # Fast vLLM-based batched evaluation
│ ├── generate_dialogues.py # Rollout generation
│ └── visualize_results.py # Results visualization
├── figures/ # All generated plots
├── generate_figures.py # Figure generation script
└── requirements.txt
pip install -r requirements.txtHardware: 2× NVIDIA A100 80GB (or equivalent with ≥40GB VRAM per GPU)
python scripts/generate_dialogues.py \
--config configs/default.yaml \
--output outputs/rollouts.json# SFT Baseline
python scripts/train_baselines.py --method sft --config configs/default.yaml
# RLVR Baseline
python scripts/train_baselines.py --method rlvr --config configs/default.yaml
# RL2F (paper's method)
python scripts/train_rl2f_only.py --config configs/default.yaml# Fast evaluation with vLLM (recommended)
python scripts/evaluate_vllm.py \
--model google/gemma-3-12b-it \
--benchmarks omni_math,coding,puzzle,maze \
--tensor-parallel-size 2
# Generate figures
python generate_figures.pyAll hyperparameters are in configs/default.yaml:
| Parameter | Value | Description |
|---|---|---|
| Model | Gemma 3 12B IT | Base model for both teacher and student |
| LoRA r (SFT) | 64 | Higher rank for supervised learning |
| LoRA r (RL) | 16 | Lower rank for RL fine-tuning |
| GRPO group size | 4 | Completions per prompt |
| GRPO β (KL) | 0.1 | KL divergence penalty |
| GRPO clip range | 0.2 | PPO-style clipping |
| Learning rate | 1e-5 | With cosine decay |
| Batch size | 4 | Per-GPU |
| Grad accumulation | 8 | Effective batch = 32 |
| Max turns | 5 | Teacher-student exchanges |
| Training epochs | 3 | For all methods |
@article{klissarov2026improving,
title={Improving Interactive In-Context Learning from Natural Language Feedback},
author={Klissarov, Martin and Mazoure, Bogdan and Piche, Alexandre and
Li, Liam and Bacon, Pierre-Luc and Precup, Doina},
journal={arXiv preprint arXiv:2602.16066},
year={2026}
}- Paper authors for the RL2F framework and experimental design
- Google for the Gemma 3 12B model
- RunPod for cloud GPU infrastructure
- vLLM for high-throughput inference with LoRA support
- Built as part of the Vizuara Modern Robot Learning Bootcamp








