Skip to content

Clarification on visual token definition and baseline alignment in Table 1 #1

Description

@cwdwc

Great work! I have a quick question regarding the token definition in Table 1 to ensure a fair comparison:

1. Your method: Your method: Does the token count (64/128/192) refer to the average tokens propagated per layer, or strictly the keep_tokens after the pruning layer? (e.g., prune_layer=3, keep_tokens=192 means a full average of ~216 tokens across LLaVA-1.5)

2. Baselines: For baselines like FastV, PDrop, and SparseVLM, their reduction mechanisms differ. For instance, SparseVLM achieves its 1505 MME score with an average of 64 tokens per layer via progressive pruning (layers [2, 6, 15] with budgets [66, 30, 17]). Were these baselines aligned based on the average layer-wise token cost or their respective configuration budgets?

Looking forward to your clarification. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions