Skip to content

fix: correct BPE merge annotations in test_wikipedia_example#111

Open
coding-shalabh wants to merge 1 commit into
karpathy:masterfrom
coding-shalabh:fix/test-annotation-issue-93
Open

fix: correct BPE merge annotations in test_wikipedia_example#111
coding-shalabh wants to merge 1 commit into
karpathy:masterfrom
coding-shalabh:fix/test-annotation-issue-93

Conversation

@coding-shalabh

Copy link
Copy Markdown

Summary

  • Fixed incorrect BPE merge sequence annotations in test_wikipedia_example docstring
  • The docstring claimed X=ZY, Y=ab, Z=aa but the actual BPE algorithm produces Z=aa, Y=Za (aaa), X=Yb (aaab)
  • Added explicit merge sequence with occurrence counts for clarity

Details

The original annotation described the merges as:

X=ZY
Y=ab
Z=aa

But running BPE on "aaabdaaabac" for 3 merges actually produces:

1. (97, 97) -> 256 (Z=aa), most frequent pair (4 occurrences)
2. (256, 97) -> 257 (Y=Za=aaa), most frequent pair (2 occurrences)
3. (257, 98) -> 258 (X=Yb=aaab), most frequent pair (2 occurrences)

This can be verified with:

tokenizer = BasicTokenizer()
tokenizer.train("aaabdaaabac", 256 + 3, verbose=True)
# merge 1/3: (97, 97) -> 256 (b'aa') had 4 occurrences
# merge 2/3: (256, 97) -> 257 (b'aaa') had 2 occurrences
# merge 3/3: (257, 98) -> 258 (b'aaab') had 2 occurrences

The test assertions were already correct ([258, 100, 258, 97, 99]), only the docstring was wrong.

Fixes #93

Test plan

  • pytest tests/test_tokenizer.py::test_wikipedia_example — both BasicTokenizer and RegexTokenizer pass

The docstring incorrectly described the merge sequence as X=ZY, Y=ab, Z=aa.
The actual BPE algorithm produces: Z=aa, Y=Za (aaa), X=Yb (aaab), because
the most frequent pair at each step is (97,97), then (256,97), then (257,98).

Added explicit merge sequence with occurrence counts for clarity.

Fixes karpathy#93
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

One problem in the annotations of test_wikipedia_example in the tests/test_tokenizer file

1 participant