Skip to content

feat(chunking_utils): token-aware chunking — prevent silent data loss from oversized chunks #374

Description

@luojiyin1987

Problem

Chunking uses character count (chunk_size=256), but embedding models have token limits (512 for nomic-embed-text, 256 for all-MiniLM-L6-v2, 8192 for text-embedding-3-small). Chars-to-tokens is not linear — code can be 1.2 tokens/char, so a 512-char code chunk becomes ~614 tokens and gets silently truncated at embedding time, losing the tail information.

Solution

Add max_tokens_per_chunk param to create_traditional_chunks() and create_text_chunks():

  • Uses existing calculate_safe_chunk_size() to auto-scale chunk_size before the SentenceSplitter runs
  • Uses existing validate_chunk_token_limits() for post-chunk validation + truncation when needed
  • AST chunking also auto-scales
  • Revalidates chunk_overlap after scaling to prevent SentenceSplitter errors

Usage

from leann.embedding_compute import get_model_token_limit

# Resolve model limit, pass to chunking
limit = get_model_token_limit("nomic-embed-text")
chunks = create_text_chunks(docs, chunk_size=256, max_tokens_per_chunk=limit)

RAG apps can use BaseRAGExample._resolve_chunk_token_limit(args) to auto-resolve from CLI args.

Changes (2 files, +113/-14)

  • chunking_utils.py: add max_tokens_per_chunk param, auto-scale, post-validate
  • base_rag_example.py: helper method + build-time warning

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions