docs(extraction): sync post-#2179 extraction doc fixes to main#2203
docs(extraction): sync post-#2179 extraction doc fixes to main#2203kheiss-uwzoo wants to merge 7 commits into
Conversation
Greptile SummaryThis PR syncs post-#2179 extraction doc fixes from the
|
| Filename | Overview |
|---|---|
| docs/docs/extraction/custom-metadata.md | File deleted — content consolidated into vdbs.md#metadata-and-filtering; redirect added in mkdocs.yml; cross-links updated in three other docs |
| docs/docs/extraction/vdbs.md | Metadata-and-filtering section expanded to absorb custom-metadata.md content; two notebook links added (one pre-existing, one new); old custom-metadata.md cross-links removed from More information and Related Topics sections |
| docs/docs/extraction/prerequisites-support-matrix.md | Explicit section anchor IDs added to three headings; image-captioning section renamed from #image-captioning-2605 to #image-captioning-nim-hardware with legacy span alias; chart-caption admonition removed and scope prose moved to multimodal-extraction.md |
| docs/docs/extraction/multimodal-extraction.md | Charts-and-infographics and image-captioning sections retargeted from stale #image-captioning-2605 anchor to in-page or new anchors; brief scope prose added under image-captioning; OCR deploy paragraph simplified |
| docs/docs/extraction/faq.md | Stale #image-captioning-2605 link replaced with two in-doc anchors in multimodal-extraction.md; Docker Compose note added to authentication FAQ answer |
| docs/docs/extraction/deployment-options.md | One-line anchor update: #image-captioning-2605 → #image-captioning-nim-hardware for offline image captioning link |
| docs/mkdocs.yml | Custom metadata nav section removed (sections 8–13 renumbered to 7–12); extraction/custom-metadata.md redirect to vdbs.md#metadata-and-filtering added alongside existing fragment-redirect pattern |
| nemo_retriever/tests/test_src_documentation_snippets.py | Deleted custom-metadata.md correctly removed from the doc-snippet test file list |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["custom-metadata.md\n(deleted)"] -->|"redirect\nextraction/custom-metadata.md\n→ vdbs.md#metadata-and-filtering"| B["vdbs.md\n#metadata-and-filtering\n(expanded)"]
C["faq.md"] -->|"old: prerequisites-support-matrix.md\n#image-captioning-2605"| D_old["❌ stale anchor"]
C -->|"new: multimodal-extraction.md\n#charts-and-infographics\n#image-captioning"| E["multimodal-extraction.md\n(caption scope prose added)"]
F["deployment-options.md"] -->|"old: #image-captioning-2605"| D_old
F -->|"new: #image-captioning-nim-hardware"| G
G["prerequisites-support-matrix.md\n### Image captioning\n{ #image-captioning-nim-hardware }\n+ legacy span #image-captioning-2605"]
H["integrations-langchain-llamaindex-haystack.md"] -->|"old: custom-metadata.md"| A
H -->|"new: vdbs.md#metadata-and-filtering"| B
I["workflow-agentic-retrieval.md"] -->|"old: custom-metadata.md"| A
I -->|"new: vdbs.md#metadata-and-filtering"| B
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A["custom-metadata.md\n(deleted)"] -->|"redirect\nextraction/custom-metadata.md\n→ vdbs.md#metadata-and-filtering"| B["vdbs.md\n#metadata-and-filtering\n(expanded)"]
C["faq.md"] -->|"old: prerequisites-support-matrix.md\n#image-captioning-2605"| D_old["❌ stale anchor"]
C -->|"new: multimodal-extraction.md\n#charts-and-infographics\n#image-captioning"| E["multimodal-extraction.md\n(caption scope prose added)"]
F["deployment-options.md"] -->|"old: #image-captioning-2605"| D_old
F -->|"new: #image-captioning-nim-hardware"| G
G["prerequisites-support-matrix.md\n### Image captioning\n{ #image-captioning-nim-hardware }\n+ legacy span #image-captioning-2605"]
H["integrations-langchain-llamaindex-haystack.md"] -->|"old: custom-metadata.md"| A
H -->|"new: vdbs.md#metadata-and-filtering"| B
I["workflow-agentic-retrieval.md"] -->|"old: custom-metadata.md"| A
I -->|"new: vdbs.md#metadata-and-filtering"| B
Reviews (13): Last reviewed commit: "Revert "docs: drop 26.05 labels from min..." | Re-trigger Greptile
477f48a to
9a24d31
Compare
9a24d31 to
019547d
Compare
…a page, remove chart admonition Remove custom-metadata.md in favor of vdbs.md#metadata-and-filtering and the metadata filtering notebook. Drop the PDF chart caption admonition from multimodal-extraction.md per review feedback.
PR NVIDIA#2194 merged into 26.05 on 2026-06-02 but never reached main. This backport keeps main aligned with the release branch and the published docs.nvidia.com site after Randy's follow-up review. Timeline: - Friday: 26.05 docs built for docs.nvidia upload; branch differed from NRL GitHub Pages source and the uploaded docs were incorrect. - Saturday: diff main vs 26.05 produced PR NVIDIA#2179 to sync extraction docs. - Monday: PR NVIDIA#2179 merged and docs uploaded to the public site. - Follow-up: Randy opened PR NVIDIA#2194 on 26.05 with additional fixes found after the NVIDIA#2179 sync. Those fixes landed on 26.05 only. - This commit: cherry-pick of c5b257e onto main (five extraction doc files only). Changes from NVIDIA#2194: - Fix audio-video.md indented code block rendering - Restore custom-metadata example service variables and storage prose - Move caption scope admonition to multimodal-extraction.md - Trim redundant Helm/OCR deploy detail per review feedback - Restore FAQ Docker Compose note and support-matrix section anchors
…a page, remove chart admonition Remove custom-metadata.md in favor of vdbs.md#metadata-and-filtering and the metadata filtering notebook. Drop the PDF chart caption admonition from multimodal-extraction.md per review feedback.
Rename the support-matrix caption section for main and keep a legacy #image-captioning-2605 alias so existing deep links keep working.
6a46fb2 to
56fa45c
Compare
Resolve modify/delete conflict on custom-metadata.md by keeping the PR deletion (content consolidated in vdbs.md with redirect). Bring in main mkdocstrings path fixes and support-matrix updates.
…A#2203 Point deployment-options.md at #image-captioning-nim-hardware. Add one sentence under multimodal-extraction #image-captioning so FAQ cross-references have scope detail without restoring the admonition.
jperez999
left a comment
There was a problem hiding this comment.
Left one thing, dont know if it is an artifact or actually what we want to reference.
| ### Image captioning { #image-captioning-nim-hardware } | ||
|
|
||
| For 26.05, use **`nemotron_3_nano_omni_30b_a3b_reasoning`** when you enable the caption stage (hosted model ID `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning`). The Helm key is in the [optional NIMs](#optional-helm-nims-not-auto-wired-by-default) table above. | ||
| <span id="image-captioning-2605"></span> |
There was a problem hiding this comment.
should this still be here? its referring to 2605?
Rename the Helm README minimal-install section, keep a legacy #recommended-minimal-install-2605 alias, point extraction docs at blob/main, and use version-neutral prose in chart default notes.
| - **Published guide** — [Custom metadata and filtering](custom-metadata.md) (sidecar `meta_*` on `vdb_upload`, compact JSON in LanceDB, server-side `where` on `Retriever.query`, and client-side `filter_hits_by_content_metadata`). | ||
| - **Canonical reference** — [Vector DB operators and LanceDB — Metadata filtering](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) in `nemo_retriever/src/nemo_retriever/vdb/README.md` (operator behavior and examples). | ||
| - [Metadata filtering notebook](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb) — end-to-end ingest, `Retriever.query`, and both filter modes | ||
| - [Sidecar metadata ingest](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb) — CLI and graph workflow |
There was a problem hiding this comment.
Dead link — notebook does not exist yet
The file examples/metadata_and_filtered_search.ipynb is not present in the repository (only examples/nemo_retriever_retriever_query_metadata_filter.ipynb exists). Anyone following this link from the published docs will hit a GitHub 404. The PR description calls this notebook out as a follow-up engineering item — the link should either be removed until the notebook lands, or replaced with the existing nemo_retriever_retriever_query_metadata_filter.ipynb which already covers end-to-end ingest and both filter modes.
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/docs/extraction/vdbs.md
Line: 97
Comment:
**Dead link — notebook does not exist yet**
The file `examples/metadata_and_filtered_search.ipynb` is not present in the repository (only `examples/nemo_retriever_retriever_query_metadata_filter.ipynb` exists). Anyone following this link from the published docs will hit a GitHub 404. The PR description calls this notebook out as a follow-up engineering item — the link should either be removed until the notebook lands, or replaced with the existing `nemo_retriever_retriever_query_metadata_filter.ipynb` which already covers end-to-end ingest and both filter modes.
How can I resolve this? If you propose a fix, please make it concise.This reverts commit 3f5fe71.
Summary
Sync extraction doc structure on
mainwith post-#2179 review fixes that landed on26.05in PR #2194 but never reachedmain. This is not a literal cherry-pick of #2194 — review feedback on this PR evolved the approach.What changed
custom-metadata.md— consolidate metadata/filtering guidance intovdbs.md#metadata-and-filtering; addmkdocs.ymlredirect; update cross-links inworkflow-agentic-retrieval.md,integrations-langchain-llamaindex-haystack.md, and the doc-snippet test list.multimodal-extraction.mdlinks fromprerequisites-support-matrix.md#image-captioning-2605to in-page anchors; rename the support-matrix section to#image-captioning-nim-hardwarewith a legacy#image-captioning-2605span for external bookmarks; fixdeployment-options.mdto use the new anchor.#image-captioninginmultimodal-extraction.md.multimodal-extraction.md; explicit section anchor IDs inprerequisites-support-matrix.md.Out of scope (already on
main)audio-video.mdmarkdown rendering fixes from fix audio-video.md markdown rendering (follow-up to #2179) #2194 are already present onmainvia the26.05merge and follow-up doc PRs.Follow-up (eng, not this PR)
Reviewer checklist
custom-metadata.md; link to metadata filtering notebook fromvdbs.md26.05from image-captioning heading; keep legacy anchor aliasmain; conflicts resolvedTest plan
extraction/custom-metadata.mdredirects tovdbs.md#metadata-and-filteringmultimodal-extraction.md#charts-and-infographicsand#image-captioningdeployment-options.mdoffline caption link targets#image-captioning-nim-hardware