Clarification on visual token definition and baseline alignment in Table 1

Great work! I have a quick question regarding the token definition in Table 1 to ensure a fair comparison:

**1. Your method:** Your method: Does the token count (64/128/192) refer to the average tokens propagated per layer, or strictly the keep_tokens after the pruning layer? (e.g., prune_layer=3, keep_tokens=192 means a full average of ~216 tokens across LLaVA-1.5)

**2. Baselines:** For baselines like FastV, PDrop, and SparseVLM, their reduction mechanisms differ. For instance, SparseVLM achieves its 1505 MME score with an average of 64 tokens per layer via progressive pruning (layers [2, 6, 15] with budgets [66, 30, 17]). Were these baselines aligned based on the average layer-wise token cost or their respective configuration budgets?

Looking forward to your clarification. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on visual token definition and baseline alignment in Table 1 #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Clarification on visual token definition and baseline alignment in Table 1 #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions