Question about Cross-modal Query

Hello, thanks for your nice work.
When conducting similar experiments on LLaVA-OV, I found that the method couldn't find the frames of interest.
https://github.com/Vision-CAIR/LongVU/blob/1ca42869fd456ecfef8acdc2aaa01e43864431e0/longvu/cambrian_arch.py#L1411-L1418

When I print out the norm of the vision feature, I suddenly found that it ranges from various values.
```
Visual features norm: tensor(18.2188, device='cuda:0', dtype=torch.float16) tensor(296.5000, device='cuda:0', dtype=torch.float16) tensor(75.7500, device='cuda:0', dtype=torch.float16)
Text features norm: tensor(0.2739, device='cuda:0', dtype=torch.float16) tensor(0.8486, device='cuda:0', dtype=torch.float16) tensor(0.7085, device='cuda:0', dtype=torch.float16)
```
That means if a frame contains more features with bigger values, its cross-modal attention is bigger than that of other frames. So this frame will be chosen. But actually, the value mostly depends on the vision_tower.
So I think **it might not be proper to compare** the cross-modal attention scores between each frame. I hope I don't miss something important. What do you think about it? I would be glad to receive your reply.

Best regards.

	sim = torch.matmul(visual_emb, text_emb.transpose(0, 1)).mean(
	dim=-1
	)
	sim_frame = sim.reshape(
	frame_split_sizes[cur_image_idx], -1
	).mean(dim=-1)
	highres_num = min(highres_num, sim_frame.shape[0])
	top_values, top_indices = torch.topk(sim_frame, highres_num)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Cross-modal Query #48

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question about Cross-modal Query #48

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions