[add] add gating term on po algorithm#664
Open
vanking20000918 wants to merge 8 commits into
Open
Conversation
…Optimization) I mainly add two points into original grpo algorithm according to this [paper](https://arxiv.org/pdf/2503.14476): Clip-Higher & Dynamic Sampling
…orithm This the main addition reference GRPO algorithm: By changing the gradient of out-of-bounds tokens from "directly set to 0" to "bounded clipping", we ensure that high-value exploration tokens can continue to participate in parameter updates while maintaining training stability. Actually, this change is not very well in experiment, because the ratio is almost nearly at 1, which means seldom out-of-bound.
modify ppo algorithm by adding gating term on importance ratio
add gating term on grpo algorithm
Author
Author
|
These modificaction is a bit like the soft gate of SAPO algorithm (Soft Adaptive Policy Optimization). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


I modify the PPO and GRPO algorithm by adding a gating mechanism at the importance ratio term, and found that this improvement can enhance the stability of the training process.
The actor loss and kl corresponding terms show more traing stability in gated PPO.

The addition of gating terms will slow down the update of high-level policy changes, preventing the model from undergoing excessive changes. It is worth noting that we set the parameter in sigmoid to a small value of 0.1, which enables the ratio to maintain its original monotonicity after gating over a wide range