Checklist
Describe the bug
When using pre-formatted data, I found that most of the data was skipped during training, leading to 1024 times smaller batch sizes. It turned out that the culprit was the different tools being passed during the construction of the eagle dataset.
Reproduction
specforge/data/preprocessing.py Line 378
Add: tools=[[] for _ in range(len(examples["text"]))],
Environment
This is a logical error and should be reproduced in the main branch with pre-formatted data that uses the build_eagle3_dataset function.
Checklist
Describe the bug
When using pre-formatted data, I found that most of the data was skipped during training, leading to 1024 times smaller batch sizes. It turned out that the culprit was the different
toolsbeing passed during the construction of the eagle dataset.Reproduction
specforge/data/preprocessing.pyLine 378Add:
tools=[[] for _ in range(len(examples["text"]))],Environment
This is a logical error and should be reproduced in the
mainbranch with pre-formatted data that uses thebuild_eagle3_datasetfunction.