Reproducing Qwen2.5-3B performance

Hi there,

I just tried running the LV-Eval -128k subset on the Qwen2.5-3B-Instruct model(no AHN enabled). The performance is terribly low and can't match the number reported in paper. 

So here is the screenshot. I ran the evaluation on a single A6000Ada 48G VRAM with single process mode. Could you please share the model prediction (the json files) so I can verify what goes wrong.


<img width="426" height="492" alt="Image" src="https://github.com/user-attachments/assets/48a9cc18-8300-44f1-8be1-5f663323807b" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing Qwen2.5-3B performance #5

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Reproducing Qwen2.5-3B performance #5

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions