Skip to content

Data Mixture Modification Script#74

Open
alexdremov wants to merge 2 commits into
swiss-ai:mainfrom
alexdremov:main
Open

Data Mixture Modification Script#74
alexdremov wants to merge 2 commits into
swiss-ai:mainfrom
alexdremov:main

Conversation

@alexdremov

@alexdremov alexdremov commented May 27, 2025

Copy link
Copy Markdown

This script can process blended dataset metadata:

  • incorporate new datasets
  • remove present datasets
  • remove already seen tokens

Processing procedure file is commited to this PR too (https://github.com/swiss-ai/Megatron-LM/pull/74/files#diff-d4fed1e9afb714170efaffdf51f31de4ffa2419a7fd029508fa496fb2ebfa7ba)


unwrapped_new_datasets = create_data_prefix([dataset])
new_megatron_datasets = [
self.create_megatron_dataset(i) for i in unwrapped_new_datasets

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason for creating megatorn dataset is to determine number of samples. This is kind of ugly, but I could not think of other solution

@alexdremov

alexdremov commented May 27, 2025

Copy link
Copy Markdown
Author
  • verify that the current mixing weight calculation is consistent with the main code. This can be done by removing and then adding the same dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant