Skip to content

[add] add gating term on po algorithm#664

Open
vanking20000918 wants to merge 8 commits into
jingyaogong:masterfrom
vanking20000918:qingguofan-gated_po_algorithm
Open

[add] add gating term on po algorithm#664
vanking20000918 wants to merge 8 commits into
jingyaogong:masterfrom
vanking20000918:qingguofan-gated_po_algorithm

Conversation

@vanking20000918

@vanking20000918 vanking20000918 commented Feb 3, 2026

Copy link
Copy Markdown

I modify the PPO and GRPO algorithm by adding a gating mechanism at the importance ratio term, and found that this improvement can enhance the stability of the training process.

  1. gated PPO
image

The actor loss and kl corresponding terms show more traing stability in gated PPO.
image

The addition of gating terms will slow down the update of high-level policy changes, preventing the model from undergoing excessive changes. It is worth noting that we set the parameter in sigmoid to a small value of 0.1, which enables the ratio to maintain its original monotonicity after gating over a wide range

Your Name and others added 8 commits January 30, 2026 11:03
…Optimization)

I mainly add two points  into original grpo algorithm according to this [paper](https://arxiv.org/pdf/2503.14476): Clip-Higher & Dynamic Sampling
…orithm

This the main addition reference GRPO algorithm: By changing the gradient of out-of-bounds tokens from "directly set to 0" to "bounded clipping", we ensure that high-value exploration tokens can continue to participate in parameter updates while maintaining training stability.
Actually, this change is not very well in experiment, because the ratio is almost nearly at 1, which means seldom out-of-bound.
modify ppo algorithm by adding gating term on importance ratio
add gating term on grpo algorithm
@vanking20000918

Copy link
Copy Markdown
Author
  1. gated GRPO
image

Interestingly, the gating term haven't help the stability in GRPO algorithm.
image

@vanking20000918 vanking20000918 marked this pull request as ready for review February 3, 2026 03:10
@vanking20000918 vanking20000918 changed the title Qingguofan gated po algorithm [add] add gating term on po algorithm Feb 3, 2026
@vanking20000918

Copy link
Copy Markdown
Author

These modificaction is a bit like the soft gate of SAPO algorithm (Soft Adaptive Policy Optimization).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant