Checklist
Motivation
Checklist
Motivation
Currently, the DFlash training pipeline in scripts/train_dflash.py is designed for text-only data. However, many real-world speculative decoding scenarios involve multimodal content (e.g., image + text). Eagle3 has demonstrated support for multimodal inputs, and it would be valuable for SpecForge's DFlash implementation to support image-text inputs as well.
Specifically, we would like to train DFlash draft models for vision-language models such as Qwen3.5/Qwen3.6, where the target model processes both images and text, and the draft model accelerates token generation by predicting future text tokens conditioned on the multimodal context.
Related resources
- Eagle3 (supports multimodal speculative decoding): https://github.com/NVlabs/Eagle
- Qwen3.5 model family (potential target models)
- Existing flexible embedding key in SpecForge:
--embedding-key (e.g., model.language_model.embed_tokens.weight)
Proposed scope
- Data pipeline: Support loading image-text paired data in
build_eagle3_dataset or a dedicated DFlash multimodal dataset builder.
- Model inputs: Allow the target model to accept pixel values / image embeddings alongside
input_ids and attention_mask.
- Hidden-state capture: Correctly capture hidden states from the target multimodal language model at the specified
target_layer_ids, taking into account vision-language fusion.
- Draft model alignment: Ensure the DFlash draft model can consume the multimodal context and produce block-level draft predictions.
- Example script & config: Provide an example training script and a reference config for a multimodal DFlash draft model.
Related resources
No response
Checklist
Motivation
Checklist
Motivation
Currently, the DFlash training pipeline in
scripts/train_dflash.pyis designed for text-only data. However, many real-world speculative decoding scenarios involve multimodal content (e.g., image + text). Eagle3 has demonstrated support for multimodal inputs, and it would be valuable for SpecForge's DFlash implementation to support image-text inputs as well.Specifically, we would like to train DFlash draft models for vision-language models such as Qwen3.5/Qwen3.6, where the target model processes both images and text, and the draft model accelerates token generation by predicting future text tokens conditioned on the multimodal context.
Related resources
--embedding-key(e.g.,model.language_model.embed_tokens.weight)Proposed scope
build_eagle3_datasetor a dedicated DFlash multimodal dataset builder.input_idsandattention_mask.target_layer_ids, taking into account vision-language fusion.Related resources
No response