Problem
Chunking uses character count (chunk_size=256), but embedding models have token limits (512 for nomic-embed-text, 256 for all-MiniLM-L6-v2, 8192 for text-embedding-3-small). Chars-to-tokens is not linear — code can be 1.2 tokens/char, so a 512-char code chunk becomes ~614 tokens and gets silently truncated at embedding time, losing the tail information.
Solution
Add max_tokens_per_chunk param to create_traditional_chunks() and create_text_chunks():
- Uses existing
calculate_safe_chunk_size() to auto-scale chunk_size before the SentenceSplitter runs
- Uses existing
validate_chunk_token_limits() for post-chunk validation + truncation when needed
- AST chunking also auto-scales
- Revalidates
chunk_overlap after scaling to prevent SentenceSplitter errors
Usage
from leann.embedding_compute import get_model_token_limit
# Resolve model limit, pass to chunking
limit = get_model_token_limit("nomic-embed-text")
chunks = create_text_chunks(docs, chunk_size=256, max_tokens_per_chunk=limit)
RAG apps can use BaseRAGExample._resolve_chunk_token_limit(args) to auto-resolve from CLI args.
Changes (2 files, +113/-14)
chunking_utils.py: add max_tokens_per_chunk param, auto-scale, post-validate
base_rag_example.py: helper method + build-time warning
Problem
Chunking uses character count (
chunk_size=256), but embedding models have token limits (512 fornomic-embed-text, 256 forall-MiniLM-L6-v2, 8192 fortext-embedding-3-small). Chars-to-tokens is not linear — code can be 1.2 tokens/char, so a 512-char code chunk becomes ~614 tokens and gets silently truncated at embedding time, losing the tail information.Solution
Add
max_tokens_per_chunkparam tocreate_traditional_chunks()andcreate_text_chunks():calculate_safe_chunk_size()to auto-scalechunk_sizebefore the SentenceSplitter runsvalidate_chunk_token_limits()for post-chunk validation + truncation when neededchunk_overlapafter scaling to prevent SentenceSplitter errorsUsage
RAG apps can use
BaseRAGExample._resolve_chunk_token_limit(args)to auto-resolve from CLI args.Changes (2 files, +113/-14)
chunking_utils.py: addmax_tokens_per_chunkparam, auto-scale, post-validatebase_rag_example.py: helper method + build-time warning