[add] add gating term on po algorithm by vanking20000918 · Pull Request #664 · jingyaogong/minimind

vanking20000918 · 2026-02-03T02:32:34Z

I modify the PPO and GRPO algorithm by adding a gating mechanism at the importance ratio term, and found that this improvement can enhance the stability of the training process.

gated PPO

The actor loss and kl corresponding terms show more traing stability in gated PPO.

The addition of gating terms will slow down the update of high-level policy changes, preventing the model from undergoing excessive changes. It is worth noting that we set the parameter in sigmoid to a small value of 0.1, which enables the ratio to maintain its original monotonicity after gating over a wide range

…Optimization) I mainly add two points into original grpo algorithm according to this [paper](https://arxiv.org/pdf/2503.14476): Clip-Higher & Dynamic Sampling

…orithm This the main addition reference GRPO algorithm: By changing the gradient of out-of-bounds tokens from "directly set to 0" to "bounded clipping", we ensure that high-value exploration tokens can continue to participate in parameter updates while maintaining training stability. Actually, this change is not very well in experiment, because the ratio is almost nearly at 1, which means seldom out-of-bound.

modify ppo algorithm by adding gating term on importance ratio

add gating term on grpo algorithm

vanking20000918 · 2026-02-03T03:09:24Z

gated GRPO

Interestingly, the gating term haven't help the stability in GRPO algorithm.

vanking20000918 · 2026-02-03T06:33:20Z

These modificaction is a bit like the soft gate of SAPO algorithm (Soft Adaptive Policy Optimization).

Your Name and others added 8 commits January 30, 2026 11:03

[mod] fix spo algorithm in RLAIF part

020bd44

Update README.md

35fe139

[add] add DAPO argorithm (Decoupled Clip and Dynamic sAmpling Policy …

7389f64

…Optimization) I mainly add two points into original grpo algorithm according to this [paper](https://arxiv.org/pdf/2503.14476): Clip-Higher & Dynamic Sampling

[add] created train_gated_ppo.py

c540ea2

modify ppo algorithm by adding gating term on importance ratio

[add] Create train_gated_grpo.py

e84437a

add gating term on grpo algorithm

Update train_gated_grpo.py

0b37f04

Update train_gated_ppo.py

db2d948

vanking20000918 marked this pull request as ready for review February 3, 2026 03:10

vanking20000918 changed the title ~~Qingguofan gated po algorithm~~ [add] add gating term on po algorithm Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[add] add gating term on po algorithm#664

[add] add gating term on po algorithm#664
vanking20000918 wants to merge 8 commits into
jingyaogong:masterfrom
vanking20000918:qingguofan-gated_po_algorithm

vanking20000918 commented Feb 3, 2026 •

edited

Loading

Uh oh!

vanking20000918 commented Feb 3, 2026

Uh oh!

vanking20000918 commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vanking20000918 commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanking20000918 commented Feb 3, 2026

Uh oh!

vanking20000918 commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vanking20000918 commented Feb 3, 2026 •

edited

Loading