This repository contains a small, educational byte-level BPE-style tokenizer implemented in tokenizer.py.
Features
- Train simple merge rules from a text string using byte-level tokens (0-255).
- Encode text into token ids and decode back to UTF-8 text.
Improvements made
- Replaced the original tangled script with a clean, well-documented module.
- Added type hints, docstrings, and clearer function names (
get_pairs,merge_ids,train,encode,decode). - Fixed variable/name inconsistencies and deterministic merge application order.
Quick start
- Run the example:
python3 tokenizer.py- Use the API from Python:
from tokenizer import train, encode, decode
text = "Hello world"
ids, merges, merges_rev = train(text, steps=50)
encoded = encode(text, merges)
decoded = decode(encoded, merges_rev)
assert decoded == textNotes
- This is a tiny illustrative implementation and is not optimized for production use.
- For robust tokenization consider existing libraries (e.g., Hugging Face Tokenizers).
License
- No license provided. Use as you like for experiments and learning.