-
Notifications
You must be signed in to change notification settings - Fork 174
feat: synthetic image and video data generation for VLM benchmarking #732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
zakariaelh
wants to merge
14
commits into
vllm-project:main
Choose a base branch
from
zakariaelh:feat/synthetic-multimodal
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
214f67b
extras/vision: add synthesize_image and synthesize_video helpers
zakariaelh 8990552
data: add synthetic_image and synthetic_video deserializers
zakariaelh 20aa0db
tests: unit + integration coverage for synthetic_image and synthetic_…
zakariaelh ef4dc7a
docs: README usage examples for synthetic_image and synthetic_video
zakariaelh 1d8a8ce
fix: declare image/video features and make encoders idempotent
zakariaelh 0b324a9
tests: add WRITTEN BY AI marker per AGENTS.md
zakariaelh f6ab0e6
Fix pre-existing lint and type-check failures
Jckwind 3166021
Add coordinate warp to synthetic gradient generator
Jckwind f417cac
docs: move synthetic visual data out of README into dedicated guide
zakariaelh ae7b2c9
review: address second-pass docs and code comments
zakariaelh 061d319
data: align synthetic vision with kind routing
alityb 5906843
tests: update synthetic vision kind coverage
alityb b015823
docs: update synthetic vision CLI examples
alityb ef6579e
tox: install dev extra for quality envs
alityb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,93 @@ | ||
| --- | ||
| weight: 40 | ||
| --- | ||
|
|
||
| # Synthetic Visual Data | ||
|
|
||
| GuideLLM can synthesize images and short videos on the fly so you can benchmark Vision-Language Model (VLM) serving configurations without bringing your own dataset. Two `--data` kinds — `synthetic_image` and `synthetic_video` — compose with the existing synthetic text token controls (`text_tokens`, `output_tokens`, and their `stdev`/`min`/`max` companions) so a single command produces a fully-shaped multimodal request. | ||
|
|
||
| Synthetic visual data is useful when you want to control payload shape precisely (image dimensions, frame count, frames-per-second) or stress-test serving paths that the preprocessor cache would otherwise hide. Defaults are tuned so every generated payload is byte-different from the next, which defeats vLLM's multimodal preprocessor cache while still compressing like real media on the wire. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| Install GuideLLM with the `vision` extra to enable image and video synthesis: | ||
|
|
||
| ```bash | ||
| pip install guidellm[vision] | ||
| ``` | ||
|
|
||
| ## Synthetic image | ||
|
|
||
| Use `--data "kind=synthetic_image"` to generate a single image per request alongside any text prompt. | ||
|
|
||
| ### Example Commands | ||
|
|
||
| A single 720p image alongside 200 text tokens and 64 output tokens: | ||
|
|
||
| ```bash | ||
| guidellm run \ | ||
| --backend "kind=openai_http,target=http://localhost:8000" \ | ||
| --data "kind=synthetic_image,resolution=720p,text_tokens=200,output_tokens=64" | ||
| ``` | ||
|
|
||
| A 1280×720 JPEG with two images per request: | ||
|
|
||
| ```bash | ||
| guidellm run \ | ||
| --backend "kind=openai_http,target=http://localhost:8000" \ | ||
| --data "kind=synthetic_image,width=1280,height=720,format=jpeg,images_per_request=2,text_tokens=200,output_tokens=64" | ||
| ``` | ||
|
|
||
| ### Configuration Options | ||
|
|
||
| - `width`: Width of the generated image in pixels. | ||
| - `height`: Height of the generated image in pixels. | ||
| - `resolution`: Shorthand that sets `height` to a named value (`480p`, `720p`, `1080p`, …); pairs with `aspect_ratio` to derive `width`. | ||
| - `aspect_ratio`: Shorthand such as `16:9` or `4:3` that derives the missing dimension when only one of `width`/`height`/`resolution` is given. | ||
| - `format`: Encoded image format, `jpeg` (default) or `png`. | ||
| - `jpeg_quality`: JPEG quality factor (1–100) when `format=jpeg`. Defaults to 85. | ||
| - `content`: Per-row image content. `gradient` (default) emits a per-row seeded gradient that compresses like real photography; `noise` emits uniform random pixels for worst-case wire size; `solid` and `checkerboard` are useful for preprocessor-cache sensitivity sweeps. | ||
| - `images_per_request`: Number of images to attach to each request. Defaults to 1. | ||
| - `text_tokens`: Average number of tokens in the accompanying text prompt. Accepts the same `stdev` / `min` / `max` suffixes as the synthetic text mode. `prompt_tokens` is accepted as an alias. | ||
| - `output_tokens`: Average number of tokens the model should generate. Same `stdev` / `min` / `max` suffixes apply. | ||
| - `seed`: Random seed for reproducible generation across runs. | ||
|
|
||
| ## Synthetic video | ||
|
|
||
| Use `--data "kind=synthetic_video"` to generate a short clip per request alongside any text prompt. Output is `mp4` (h264, yuv420p). | ||
|
|
||
| ### Example Commands | ||
|
|
||
| A six-frame 480p clip at 1 fps with modest prompt and output budgets: | ||
|
|
||
| ```bash | ||
| guidellm run \ | ||
| --backend "kind=openai_http,target=http://localhost:8000" \ | ||
| --data "kind=synthetic_video,width=854,height=480,frames=6,fps=1,text_tokens=64,output_tokens=128" | ||
| ``` | ||
|
|
||
| A twelve-frame 720p clip at 3 fps with an explicit h264 target bitrate: | ||
|
|
||
| ```bash | ||
| guidellm run \ | ||
| --backend "kind=openai_http,target=http://localhost:8000" \ | ||
| --data "kind=synthetic_video,width=1280,height=720,frames=12,fps=3,video_bitrate=2M,text_tokens=64,output_tokens=128" | ||
| ``` | ||
|
|
||
| ### Configuration Options | ||
|
|
||
| - `width`: Width of the generated video in pixels. | ||
| - `height`: Height of the generated video in pixels. The same `resolution` / `aspect_ratio` shorthands as for synthetic image apply. | ||
| - `frames`: Number of frames in the clip. | ||
| - `fps`: Frames per second. Combined with `frames`, this also determines the clip duration. | ||
| - `video_bitrate`: Optional h264 target bitrate (e.g. `1M`, `500k`) — useful when you want to specify a fixed wire size across runs. | ||
| - `content`: Per-row clip content. `gradient` (default) emits a seeded gradient with a coordinate warp so each clip compresses similarly to real video; `noise` emits uniform random pixels for worst-case wire size. | ||
| - `text_tokens`: Average number of tokens in the accompanying text prompt; same `stdev` / `min` / `max` suffixes as synthetic image. `prompt_tokens` is accepted as an alias. | ||
| - `output_tokens`: Average number of tokens the model should generate; same `stdev` / `min` / `max` suffixes apply. | ||
| - `seed`: Random seed for reproducible generation across runs. | ||
|
|
||
| ## Notes | ||
|
|
||
| - A tokenizer is required for the text portion of the request. By default the model passed in or retrieved from the server is used; otherwise specify one with `--tokenizer`. | ||
| - Per-row seeded gradients produce byte-different payloads on every request, which bypasses vLLM's multimodal preprocessor cache. If you want to deliberately hit the cache, use fixed payload settings such as `content=solid` for images, or a fixed `seed` with a fixed `--data-loader "kind=pytorch,samples=..."` value. | ||
| - The exact mp4 bytes produced for a given seed depend on the installed `ffmpeg` and `PIL` versions. Output token counts and request shape stay stable across versions, but if you are comparing byte-level outputs or wire-size measurements across machines, expect small variation. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -93,6 +93,7 @@ audio = [ | |
| vision = [ | ||
| "datasets[vision]", | ||
| "pillow", | ||
| "imageio[ffmpeg]", | ||
| ] | ||
| # Dev Tooling | ||
| dev = [ | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.