A minimal implementation of PPO-based Reinforcement Learning for vision-language models, focusing on mathematical reasoning tasks.
๐ง Work in progress...
This implementation is inspired by the MAYE framework described in "Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme".
@article{ma2025maye,
title={Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme},
author={Ma, Yan and Chern, Steffi and Shen, Xuyang and Zhong, Yiran and Liu, Pengfei},
journal={arXiv preprint arXiv:2504.02587},
year={2025},
}