Skip to content

[Feature] Support image-text multimodal input for DFlash training, similar to Eagle3 #583

Description

@curnane-lab

Checklist

Motivation

Checklist

Motivation

Currently, the DFlash training pipeline in scripts/train_dflash.py is designed for text-only data. However, many real-world speculative decoding scenarios involve multimodal content (e.g., image + text). Eagle3 has demonstrated support for multimodal inputs, and it would be valuable for SpecForge's DFlash implementation to support image-text inputs as well.

Specifically, we would like to train DFlash draft models for vision-language models such as Qwen3.5/Qwen3.6, where the target model processes both images and text, and the draft model accelerates token generation by predicting future text tokens conditioned on the multimodal context.

Related resources

  • Eagle3 (supports multimodal speculative decoding): https://github.com/NVlabs/Eagle
  • Qwen3.5 model family (potential target models)
  • Existing flexible embedding key in SpecForge: --embedding-key (e.g., model.language_model.embed_tokens.weight)

Proposed scope

  1. Data pipeline: Support loading image-text paired data in build_eagle3_dataset or a dedicated DFlash multimodal dataset builder.
  2. Model inputs: Allow the target model to accept pixel values / image embeddings alongside input_ids and attention_mask.
  3. Hidden-state capture: Correctly capture hidden states from the target multimodal language model at the specified target_layer_ids, taking into account vision-language fusion.
  4. Draft model alignment: Ensure the DFlash draft model can consume the multimodal context and produce block-level draft predictions.
  5. Example script & config: Provide an example training script and a reference config for a multimodal DFlash draft model.

Related resources

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions