Add Tokenizer Comparison Script#150
Conversation
This script compares the basic `.encode` tokenization between Hugging Face and Mistral Common tokenizers across multiple datasets.
juliendenize
left a comment
There was a problem hiding this comment.
Thanks a bunch for these scripts, left few comments. For general feedback and discussions:
- Can you run pre-commit to silence the tests and get rid of linter/formatter issues ?
- As we plan to also host chat templates, wdyt about creating a folder to host scripts, files dedicated to integrations in libraries ?
ive currently added a scripts folder, would you like to have subfolders for each integration? also, should I include in this script more test cases like basic instruct datasets, function calling, etc? |
Yeah actually I was thinking the other way around having a parent folder named
Yes it would be nice to have multimodal (audio, image), function calling, instruct and reasoning. If you need help lmk :) |
This script compares the basic
.encodetokenization between Hugging Face and Mistral Common tokenizers across multiple datasets.