Skip to content

Add Tokenizer Comparison Script#150

Open
pandora-s-git wants to merge 6 commits into
mistralai:mainfrom
pandora-s-git:patch-3
Open

Add Tokenizer Comparison Script#150
pandora-s-git wants to merge 6 commits into
mistralai:mainfrom
pandora-s-git:patch-3

Conversation

@pandora-s-git

Copy link
Copy Markdown
Contributor

This script compares the basic .encode tokenization between Hugging Face and Mistral Common tokenizers across multiple datasets.

This script compares the basic `.encode` tokenization between Hugging Face and Mistral Common tokenizers across multiple datasets.

@juliendenize juliendenize left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a bunch for these scripts, left few comments. For general feedback and discussions:

  • Can you run pre-commit to silence the tests and get rid of linter/formatter issues ?
  • As we plan to also host chat templates, wdyt about creating a folder to host scripts, files dedicated to integrations in libraries ?

Comment thread scripts/compare_tokenizer.py Outdated
Comment thread scripts/compare_tokenizer.py Outdated
Comment thread scripts/compare_tokenizer.py Outdated
@pandora-s-git

pandora-s-git commented Nov 12, 2025

Copy link
Copy Markdown
Contributor Author

As we plan to also host chat templates, wdyt about creating a folder to host scripts, files dedicated to integrations in libraries ?

ive currently added a scripts folder, would you like to have subfolders for each integration?

also, should I include in this script more test cases like basic instruct datasets, function calling, etc?

@juliendenize

Copy link
Copy Markdown
Contributor

ive currently added a scripts folder, would you like to have subfolders for each integration?

Yeah actually I was thinking the other way around having a parent folder named integration or external ? (if you come up with better naming don't hesitate) that would contains scripts as a subfolder with chat_templates.

also, should I include in this script more test cases like basic instruct datasets, function calling, etc?

Yes it would be nice to have multimodal (audio, image), function calling, instruct and reasoning. If you need help lmk :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants