Hello, thanks for your nice work.
When conducting similar experiments on LLaVA-OV, I found that the method couldn't find the frames of interest.
|
sim = torch.matmul(visual_emb, text_emb.transpose(0, 1)).mean( |
|
dim=-1 |
|
) |
|
sim_frame = sim.reshape( |
|
frame_split_sizes[cur_image_idx], -1 |
|
).mean(dim=-1) |
|
highres_num = min(highres_num, sim_frame.shape[0]) |
|
top_values, top_indices = torch.topk(sim_frame, highres_num) |
When I print out the norm of the vision feature, I suddenly found that it ranges from various values.
Visual features norm: tensor(18.2188, device='cuda:0', dtype=torch.float16) tensor(296.5000, device='cuda:0', dtype=torch.float16) tensor(75.7500, device='cuda:0', dtype=torch.float16)
Text features norm: tensor(0.2739, device='cuda:0', dtype=torch.float16) tensor(0.8486, device='cuda:0', dtype=torch.float16) tensor(0.7085, device='cuda:0', dtype=torch.float16)
That means if a frame contains more features with bigger values, its cross-modal attention is bigger than that of other frames. So this frame will be chosen. But actually, the value mostly depends on the vision_tower.
So I think it might not be proper to compare the cross-modal attention scores between each frame. I hope I don't miss something important. What do you think about it? I would be glad to receive your reply.
Best regards.
Hello, thanks for your nice work.
When conducting similar experiments on LLaVA-OV, I found that the method couldn't find the frames of interest.
LongVU/longvu/cambrian_arch.py
Lines 1411 to 1418 in 1ca4286
When I print out the norm of the vision feature, I suddenly found that it ranges from various values.
That means if a frame contains more features with bigger values, its cross-modal attention is bigger than that of other frames. So this frame will be chosen. But actually, the value mostly depends on the vision_tower.
So I think it might not be proper to compare the cross-modal attention scores between each frame. I hope I don't miss something important. What do you think about it? I would be glad to receive your reply.
Best regards.