Skip to content

Question about Cross-modal Query #48

@cokeshao

Description

@cokeshao

Hello, thanks for your nice work.
When conducting similar experiments on LLaVA-OV, I found that the method couldn't find the frames of interest.

LongVU/longvu/cambrian_arch.py

Lines 1411 to 1418 in 1ca4286

sim = torch.matmul(visual_emb, text_emb.transpose(0, 1)).mean(
dim=-1
)
sim_frame = sim.reshape(
frame_split_sizes[cur_image_idx], -1
).mean(dim=-1)
highres_num = min(highres_num, sim_frame.shape[0])
top_values, top_indices = torch.topk(sim_frame, highres_num)

When I print out the norm of the vision feature, I suddenly found that it ranges from various values.

Visual features norm: tensor(18.2188, device='cuda:0', dtype=torch.float16) tensor(296.5000, device='cuda:0', dtype=torch.float16) tensor(75.7500, device='cuda:0', dtype=torch.float16)
Text features norm: tensor(0.2739, device='cuda:0', dtype=torch.float16) tensor(0.8486, device='cuda:0', dtype=torch.float16) tensor(0.7085, device='cuda:0', dtype=torch.float16)

That means if a frame contains more features with bigger values, its cross-modal attention is bigger than that of other frames. So this frame will be chosen. But actually, the value mostly depends on the vision_tower.
So I think it might not be proper to compare the cross-modal attention scores between each frame. I hope I don't miss something important. What do you think about it? I would be glad to receive your reply.

Best regards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions