feat(chunking_utils): token-aware chunking — prevent silent data loss from oversized chunks

## Problem

Chunking uses character count (`chunk_size=256`), but embedding models have **token limits** (512 for `nomic-embed-text`, 256 for `all-MiniLM-L6-v2`, 8192 for `text-embedding-3-small`). Chars-to-tokens is not linear — code can be 1.2 tokens/char, so a 512-char code chunk becomes ~614 tokens and gets **silently truncated** at embedding time, losing the tail information.

## Solution

Add `max_tokens_per_chunk` param to `create_traditional_chunks()` and `create_text_chunks()`:

- Uses existing `calculate_safe_chunk_size()` to auto-scale `chunk_size` before the SentenceSplitter runs
- Uses existing `validate_chunk_token_limits()` for post-chunk validation + truncation when needed
- AST chunking also auto-scales
- Revalidates `chunk_overlap` after scaling to prevent SentenceSplitter errors

## Usage

```python
from leann.embedding_compute import get_model_token_limit

# Resolve model limit, pass to chunking
limit = get_model_token_limit("nomic-embed-text")
chunks = create_text_chunks(docs, chunk_size=256, max_tokens_per_chunk=limit)
```

RAG apps can use `BaseRAGExample._resolve_chunk_token_limit(args)` to auto-resolve from CLI args.

## Changes (2 files, +113/-14)
- `chunking_utils.py`: add `max_tokens_per_chunk` param, auto-scale, post-validate
- `base_rag_example.py`: helper method + build-time warning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(chunking_utils): token-aware chunking — prevent silent data loss from oversized chunks #374

Problem

Solution

Usage

Changes (2 files, +113/-14)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat(chunking_utils): token-aware chunking — prevent silent data loss from oversized chunks #374

Description

Problem

Solution

Usage

Changes (2 files, +113/-14)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions