[Feature] VLM DFlash Training: Multi-Model Support for Qwen3-VL / Qwen3.5 / Qwen3.6#585
Open
zyk42 wants to merge 1 commit into
Open
[Feature] VLM DFlash Training: Multi-Model Support for Qwen3-VL / Qwen3.5 / Qwen3.6#585zyk42 wants to merge 1 commit into
zyk42 wants to merge 1 commit into
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Good job! |
|
@FrankLeeeee This PR adds DFlash training support for Qwen3-VL, Qwen3.5, and Qwen3.6 models, including HF loading, partial-RoPE support, automatic embedding detection, and transformers 5.7.0 compatibility. Validation on Qwen3-VL-30B-A3B achieved 3.52 accept length and +35.8% inference speedup. |
…ckend) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f3a65b4 to
9323a51
Compare
|
from specforge.core.dflash import OnlineDFlashModel, QwenVLOnlineDFlashModel ,it seems that QwenVLOnlineDFlashModel is not upload now? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Model support
qwen3_vlqwen3_vl_moeqwen3_5qwen3_5_moeqwen3_5_moeDescription
This PR extends SpecForge DFlash training to support multiple VLM model families with HF backend. Key changes:
Extended VLM model types —
QWEN3_VL_MODEL_TYPESnow includesqwen3_5_moeandqwen3_5in addition toqwen3_vlandqwen3_vl_moe.Auto VLM embedding detection — Automatically sets
--embedding-key=model.language_model.embed_tokens.weightfor all VLM model types (previously required manual specification for qwen3_vl).Qwen3.5/3.6 HF loading — Added
Qwen3_5MoeForConditionalGenerationimport path with graceful fallback for transformers < 5.7.0.Partial rotation RoPE — Qwen3.5/3.6 uses
partial_rotary_factor=0.25(only 64 out of 256 dims get RoPE). Updatedapply_rotary_pos_embin the DFlash draft model to handlerotary_dim < head_dim.Draft configs — New config files for all supported models with correct
target_layer_ids(starting from layer 3+ for deepstack VLMs),mrope_section, andpartial_rotary_factor.transformers 5.7.0 compatibility — Added
mm_token_type_idsgeneration for Qwen3-VL models (required by transformers >= 5.7.0). Generates token type IDs from input_ids (image tokens → 1, video tokens → 2) and passes them to both forward() and get_rope_index().Training results (Qwen3-VL-30B-A3B, GUI Agent task)
Configuration
Best result (278K, 5-layer, block_size=8)
Data scaling results
Break-even analysis (4x RTX 5090, TP=4)
Overfitting behavior (100K data)
Key observations
target_layer_ids: Qwen3-VL must skip first 3 layers (deepstack); Qwen3.5/3.6 starts from layer 1Usage
Draft config design
Key design principles:
model_typealwaysqwen3_vl_text(draft is always dense decoder, regardless of target architecture)target_layer_ids: 5 layers, evenly distributed. Qwen3-VL starts from layer 3 (deepstack); Qwen3.5/3.6 starts from layer 0hidden_size,num_attention_heads,num_key_value_heads,head_dim,vocab_size,rope_thetawith targetpartial_rotary_factor=0.25(only 64/256 dims use RoPE),rope_theta=10000000rope_theta=5000000Files changed
scripts/train_dflash.pyQWEN3_VL_MODEL_TYPESto includeqwen3_5_moe,qwen3_5; auto-detect VLM embedding key; add Qwen3.5 HF loadingspecforge/modeling/draft/dflash.pyapply_rotary_pos_embsupports partial rotation (rotary_dim < head_dim)specforge/modeling/target/dflash_target_model.pyqwen3_5_moe/qwen3_5loading; addmm_token_type_idsgeneration for transformers 5.7.0 compatibilityconfigs/qwen3.5-35b-a3b-dflash-vlm-8layer.jsonconfigs/qwen3.5-9b-dflash-vlm-8layer.jsonconfigs/qwen3-vl-30b-a3b-dflash-vlm-8layer.jsonconfigs/qwen3-vl-8b-dflash-vlm-8layer.json