Skip to content

fix(rag): split oversized markdown header chunks#3085

Open
wolfkill wants to merge 1 commit into
eosphoros-ai:mainfrom
wolfkill:fix/markdown-oversize-code-chunks
Open

fix(rag): split oversized markdown header chunks#3085
wolfkill wants to merge 1 commit into
eosphoros-ai:mainfrom
wolfkill:fix/markdown-oversize-code-chunks

Conversation

@wolfkill

Copy link
Copy Markdown
Contributor

Summary

Why

Markdown documents can contain large fenced YAML/TOML/config blocks under a single header. The header splitter previously returned that whole section as one chunk, which can exceed embedding provider input limits even when chunk_size is configured.

Tests

  • Red test before implementation:
    • .venv/bin/python -m pytest packages/dbgpt-core/src/dbgpt/rag/text_splitter/tests/test_splitters.py::test_md_header_text_splitter_splits_oversized_code_block -q
    • Failed as expected because only one oversized chunk was produced.
  • .venv/bin/python -m pytest packages/dbgpt-core/src/dbgpt/rag/text_splitter/tests/test_splitters.py::test_md_header_text_splitter_splits_oversized_code_block -q
  • .venv/bin/python -m pytest packages/dbgpt-core/src/dbgpt/rag/text_splitter/tests/test_splitters.py packages/dbgpt-ext/src/dbgpt_ext/rag/knowledge/tests/test_markdown.py -q
  • Manual default-path check with MarkdownKnowledge + ChunkManager + ChunkParameters(chunk_strategy="Automatic", chunk_size=128, chunk_overlap=16):
    • Result: 17 chunks, max length 127, header/source metadata preserved.
  • .venv/bin/python -m ruff check packages/dbgpt-core/src/dbgpt/rag/text_splitter/text_splitter.py packages/dbgpt-core/src/dbgpt/rag/text_splitter/tests/test_splitters.py
  • .venv/bin/python -m ruff format --check packages/dbgpt-core/src/dbgpt/rag/text_splitter/text_splitter.py packages/dbgpt-core/src/dbgpt/rag/text_splitter/tests/test_splitters.py
  • .venv/bin/python -m compileall -q packages/dbgpt-core/src/dbgpt/rag/text_splitter/text_splitter.py packages/dbgpt-core/src/dbgpt/rag/text_splitter/tests/test_splitters.py
  • git diff --check

@github-actions github-actions Bot added the fix Bug fixes label May 25, 2026
@wolfkill wolfkill force-pushed the fix/markdown-oversize-code-chunks branch from babd3d6 to c181185 Compare May 26, 2026 02:43
@wolfkill

Copy link
Copy Markdown
Contributor Author

Follow-up update:

  • Rebased this branch onto the latest origin/main after fix(rag): use size chunking by default for markdown knowledge (Fixes #3030) #3033 added markdown size chunking by default.
  • Resolved the splitter conflict by keeping the upstream fallback split behavior and only clamping the fallback overlap to avoid introducing a chunk_overlap > chunk_size error for direct MarkdownHeaderTextSplitter(chunk_size=...) usage.
  • Also formatted react_parser.py and test_react_parser.py because the repository CI runs full-package make fmt-check, and those files were the current format blocker.

Local verification:

  • PYTHONPATH=packages/dbgpt-core/src .venv/bin/python -m pytest packages/dbgpt-core/src/dbgpt/rag/text_splitter/tests/test_splitters.py -q -> 5 passed
  • PYTHONPATH=packages/dbgpt-core/src .venv/bin/python -m pytest packages/dbgpt-core/src/dbgpt/agent/util/tests/test_react_parser.py -q -> 24 passed
  • make fmt-check -> passed
  • PYTHONPATH=packages/dbgpt-core/src .venv/bin/python -m compileall -q packages/dbgpt-core/src/dbgpt/rag/text_splitter/text_splitter.py packages/dbgpt-core/src/dbgpt/rag/text_splitter/tests/test_splitters.py packages/dbgpt-core/src/dbgpt/agent/util/react_parser.py packages/dbgpt-core/src/dbgpt/agent/util/tests/test_react_parser.py -> passed
  • git diff --check -> passed

GitHub CI is now green on this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Bug fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]RAG markdown default chunk strategy fails on large embedded YAML/TOML code blocks (single oversize chunk)

1 participant