@yzhbradoodrrpurp @shx2005
Hi great work!
I'd like to point you to a closely related prior paper that worth citing: Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation (arXiv:2605.18740, code: https://github.com/VisionOPD/Vision-OPD). It shares V-Zero's core setup—using on-policy distillation from a crop-conditioned teacher to a full-image student, is answer-label-free, needs no RL/reward verifier, and adds no inference-time tools.
Since crop-conditioned OPD is central to V-Zero, it'd be great to cite and discuss Vision-OPD in the paper. V-Zero's contribution is a clear addition on top of this line of work, so positioning it relative to Vision-OPD would help readers understand what's new.
@yzhbradoodrrpurp @shx2005
Hi great work!
I'd like to point you to a closely related prior paper that worth citing: Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation (arXiv:2605.18740, code: https://github.com/VisionOPD/Vision-OPD). It shares V-Zero's core setup—using on-policy distillation from a crop-conditioned teacher to a full-image student, is answer-label-free, needs no RL/reward verifier, and adds no inference-time tools.
Since crop-conditioned OPD is central to V-Zero, it'd be great to cite and discuss Vision-OPD in the paper. V-Zero's contribution is a clear addition on top of this line of work, so positioning it relative to Vision-OPD would help readers understand what's new.