GRPO with Verifiable Rewards

Group Relative Policy Optimization (GRPO) applied to math reasoning and structured output verification — exploring RL post-training without a value network.

Context: GRPO eliminates the value network (43% VRAM savings), making RL post-training feasible on enterprise GPU constraints. Verifiable rewards (code execution, schema validation) provide the signal.

flowchart LR
    P[Problem] --> G[Generate N Solutions]
    G --> V[Verifier]
    V --> |Correct| R+[Reward = 1]
    V --> |Wrong| R-[Reward = 0]
    R+ --> GRPO[GRPO Update]
    R- --> GRPO
    GRPO --> P

🧮 Mathematical Foundation

GRPO Objective (Group Relative Policy Optimization)

$$\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}_{x}\left[\frac{1}{G}\sum_{i=1}^{G} \hat{A}_i \cdot \frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)} - \beta \text{KL}(\pi_\theta | \pi_{\text{ref}})\right]$$

Group-Relative Advantage

$$\hat{A}_i = \frac{r_i - \text{mean}({r_j}_{j=1}^G)}{\text{std}({r_j}_{j=1}^G) + \epsilon}$$

No value network needed — advantages computed relative to the group.

Verifiable Reward Function

$$r(x, y) = \begin{cases} 1.0 & \text{if } \text{verify}(y, \text{answer}(x)) = \text{True} \ 0.5 & \text{if format_valid}(y) \wedge \neg \text{correct}(y) \ 0.0 & \text{otherwise} \end{cases}$$

PPO Clipping (for comparison)

$$\mathcal{L}_{\text{PPO}} = -\min\left(\frac{\pi_\theta}{\pi_{\text{old}}} \hat{A}, \text{clip}\left(\frac{\pi_\theta}{\pi_{\text{old}}}, 1-\epsilon, 1+\epsilon\right) \hat{A}\right)$$

GRPO eliminates the value network overhead → 40% less VRAM than PPO.

Dynamical Systems Perspective

RL training is a dynamical system on policy space: $$\pi_{t+1} = \pi_t + \eta \nabla_\theta J(\pi_t)$$

GRPO's group-relative normalization acts as an adaptive step-size controller, analogous to trust-region methods in optimization — preventing catastrophic policy updates.

📊 Results (GSM8K, LFM2.5-1.2B)

Method	GSM8K Acc	VRAM	Training Time
SFT only	35%	2.8 GB	15 min
+ PPO (value network)	42%	7.2 GB	4 hrs
+ GRPO (this repo)	45%	4.1 GB	2 hrs
+ GRPO + verifier refinement	48%	4.3 GB	2.5 hrs

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
viz		viz
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GRPO with Verifiable Rewards

🧮 Mathematical Foundation

GRPO Objective (Group Relative Policy Optimization)

Group-Relative Advantage

Verifiable Reward Function

PPO Clipping (for comparison)

Dynamical Systems Perspective

📊 Results (GSM8K, LFM2.5-1.2B)

License

📸 Visual Tour

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GRPO with Verifiable Rewards

🧮 Mathematical Foundation

GRPO Objective (Group Relative Policy Optimization)

Group-Relative Advantage

Verifiable Reward Function

PPO Clipping (for comparison)

Dynamical Systems Perspective

📊 Results (GSM8K, LFM2.5-1.2B)

License

📸 Visual Tour

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages