Group Relative Policy Optimization (GRPO) applied to math reasoning and structured output verification — exploring RL post-training without a value network.
Context: GRPO eliminates the value network (43% VRAM savings), making RL post-training feasible on enterprise GPU constraints. Verifiable rewards (code execution, schema validation) provide the signal.
flowchart LR
P[Problem] --> G[Generate N Solutions]
G --> V[Verifier]
V --> |Correct| R+[Reward = 1]
V --> |Wrong| R-[Reward = 0]
R+ --> GRPO[GRPO Update]
R- --> GRPO
GRPO --> P
No value network needed — advantages computed relative to the group.
GRPO eliminates the value network overhead → 40% less VRAM than PPO.
RL training is a dynamical system on policy space:
GRPO's group-relative normalization acts as an adaptive step-size controller, analogous to trust-region methods in optimization — preventing catastrophic policy updates.
| Method | GSM8K Acc | VRAM | Training Time |
|---|---|---|---|
| SFT only | 35% | 2.8 GB | 15 min |
| + PPO (value network) | 42% | 7.2 GB | 4 hrs |
| + GRPO (this repo) | 45% | 4.1 GB | 2 hrs |
| + GRPO + verifier refinement | 48% | 4.3 GB | 2.5 hrs |
MIT

