[Feature] Support image-text multimodal input for DFlash training, similar to Eagle3

### Checklist

- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

### Checklist

- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

Currently, the DFlash training pipeline in `scripts/train_dflash.py` is designed for text-only data. However, many real-world speculative decoding scenarios involve multimodal content (e.g., image + text). Eagle3 has demonstrated support for multimodal inputs, and it would be valuable for SpecForge's DFlash implementation to support image-text inputs as well.

Specifically, we would like to train DFlash draft models for vision-language models such as Qwen3.5/Qwen3.6, where the target model processes both images and text, and the draft model accelerates token generation by predicting future text tokens conditioned on the multimodal context.

### Related resources

- Eagle3 (supports multimodal speculative decoding): https://github.com/NVlabs/Eagle
- Qwen3.5 model family (potential target models)
- Existing flexible embedding key in SpecForge: `--embedding-key` (e.g., `model.language_model.embed_tokens.weight`)

### Proposed scope

1. **Data pipeline**: Support loading image-text paired data in `build_eagle3_dataset` or a dedicated DFlash multimodal dataset builder.
2. **Model inputs**: Allow the target model to accept pixel values / image embeddings alongside `input_ids` and `attention_mask`.
3. **Hidden-state capture**: Correctly capture hidden states from the target multimodal language model at the specified `target_layer_ids`, taking into account vision-language fusion.
4. **Draft model alignment**: Ensure the DFlash draft model can consume the multimodal context and produce block-level draft predictions.
5. **Example script & config**: Provide an example training script and a reference config for a multimodal DFlash draft model.


### Related resources

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Support image-text multimodal input for DFlash training, similar to Eagle3 #583

Checklist

Motivation

Checklist

Motivation

Related resources

Proposed scope

Related resources

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] Support image-text multimodal input for DFlash training, similar to Eagle3 #583

Description

Checklist

Motivation

Checklist

Motivation

Related resources

Proposed scope

Related resources

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions