Fineâtune Qwen 2.5âVL (or any VisionâLanguage model with the same API) on image grounding tasks using GRPO (Generic Reward Prediction Optimization) in just a few lines of code.
- Plugâandâplay trainer â drop in your own JSON dataset of prompts + boundingâboxes and start training.
- Imageâaware data collator â automatically loads, preprocesses and batches images.
- Rewardâbased optimisation â leverages the
trllibraryâs GRPO algorithm for RLâstyle fineâtuning. - Minimal codebase â only three Python files, easy to read and customise.
- Accepts an
image_processorand animages_rootfolder. - Overrides
data_collatorto- Load images with Pillow.
- Batchâencode them via the Hugging Face
AutoProcessor. - Return a dict containing
pixel_values â tensor (C Ă H Ă W)prompt â instruction stringsolution â groundâtruth bbox or coordinatesscales â original image size
Tiny subclass that forwards all arguments to the real Qwen 2.5âVL model while gracefully ignoring the extra logits_to_keep parameter expected by GRPO.
Currently only accuracy_reward_coord, which returns 1 if the (x, y) coordinate predicted by the model falls inside the groundâtruth boundingâbox and 0 otherwise.
Feel free to add IoUâ or distanceâbased rewards here.
Provides a concrete example wiring everything together.
Customise the constants at the top, or replace them with argparse flags for production use.
| Hyperâparameter | Where to set | Notes |
|---|---|---|
per_device_train_batch_size |
GRPOConfig |
Limited by GPU memory â images are heavy! |
num_generations |
GRPOConfig |
How many action samples to draw per prompt. |
reward_funcs |
trainer init | List of callables returning a reward â {0, 1}. |
bf16 / fp16 |
GRPOConfig |
Use bf16 on A100/H100 for speed and memory efficiency. |