Skip to content

SurajAdhikari01/BasicTokenizerForLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Tiny Byte-Level BPE Tokenizer

This repository contains a small, educational byte-level BPE-style tokenizer implemented in tokenizer.py.

Features

  • Train simple merge rules from a text string using byte-level tokens (0-255).
  • Encode text into token ids and decode back to UTF-8 text.

Improvements made

  • Replaced the original tangled script with a clean, well-documented module.
  • Added type hints, docstrings, and clearer function names (get_pairs, merge_ids, train, encode, decode).
  • Fixed variable/name inconsistencies and deterministic merge application order.

Quick start

  1. Run the example:
python3 tokenizer.py
  1. Use the API from Python:
from tokenizer import train, encode, decode

text = "Hello world"
ids, merges, merges_rev = train(text, steps=50)
encoded = encode(text, merges)
decoded = decode(encoded, merges_rev)
assert decoded == text

Notes

  • This is a tiny illustrative implementation and is not optimized for production use.
  • For robust tokenization consider existing libraries (e.g., Hugging Face Tokenizers).

License

  • No license provided. Use as you like for experiments and learning.

BasicTokenizerForLLM

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages