Skip to content

docs(extraction): sync post-#2179 extraction doc fixes to main#2203

Open
kheiss-uwzoo wants to merge 7 commits into
NVIDIA:mainfrom
kheiss-uwzoo:docs/backport-2194-extraction-docs-fix
Open

docs(extraction): sync post-#2179 extraction doc fixes to main#2203
kheiss-uwzoo wants to merge 7 commits into
NVIDIA:mainfrom
kheiss-uwzoo:docs/backport-2194-extraction-docs-fix

Conversation

@kheiss-uwzoo

@kheiss-uwzoo kheiss-uwzoo commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Sync extraction doc structure on main with post-#2179 review fixes that landed on 26.05 in PR #2194 but never reached main. This is not a literal cherry-pick of #2194 — review feedback on this PR evolved the approach.

What changed

  • Remove custom-metadata.md — consolidate metadata/filtering guidance into vdbs.md#metadata-and-filtering; add mkdocs.yml redirect; update cross-links in workflow-agentic-retrieval.md, integrations-langchain-llamaindex-haystack.md, and the doc-snippet test list.
  • Fix stale caption anchors — retarget FAQ and multimodal-extraction.md links from prerequisites-support-matrix.md#image-captioning-2605 to in-page anchors; rename the support-matrix section to #image-captioning-nim-hardware with a legacy #image-captioning-2605 span for external bookmarks; fix deployment-options.md to use the new anchor.
  • Trim support-matrix admonition — remove the chart-caption admonition block per review; add one sentence of caption-scope prose under #image-captioning in multimodal-extraction.md.
  • Other fix audio-video.md markdown rendering (follow-up to #2179) #2194 fixes — FAQ Docker Compose note; simplified OCR deploy prose in multimodal-extraction.md; explicit section anchor IDs in prerequisites-support-matrix.md.

Out of scope (already on main)

Follow-up (eng, not this PR)

Reviewer checklist

  • Delete custom-metadata.md; link to metadata filtering notebook from vdbs.md
  • Remove chart-caption admonition from support matrix and multimodal-extraction
  • Drop 26.05 from image-captioning heading; keep legacy anchor alias
  • Rebase/merge main; conflicts resolved

Test plan

  • MkDocs build passes
  • extraction/custom-metadata.md redirects to vdbs.md#metadata-and-filtering
  • FAQ chart-caption answer links resolve to multimodal-extraction.md#charts-and-infographics and #image-captioning
  • deployment-options.md offline caption link targets #image-captioning-nim-hardware

@kheiss-uwzoo kheiss-uwzoo requested review from a team as code owners June 2, 2026 22:44
@kheiss-uwzoo kheiss-uwzoo requested a review from jperez999 June 2, 2026 22:44
@kheiss-uwzoo kheiss-uwzoo changed the title docs(extraction): backport PR #2194 extraction doc fixes to main backport PR #2194 extraction doc fixes to main Jun 2, 2026
@greptile-apps

greptile-apps Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR syncs post-#2179 extraction doc fixes from the 26.05 branch back to main, consolidating custom-metadata.md into vdbs.md and repairing stale anchor links throughout the extraction doc set.

  • Deleted custom-metadata.md and absorbed its guidance into vdbs.md#metadata-and-filtering; added an MkDocs redirect and updated all six cross-link sites (integrations, agentic retrieval, related-topics, etc.).
  • Fixed stale #image-captioning-2605 anchors in faq.md, multimodal-extraction.md, and deployment-options.md; renamed the support-matrix section to #image-captioning-nim-hardware and left a legacy <span id="image-captioning-2605"> alias for external bookmarks.
  • Trimmed the chart-caption admonition from prerequisites-support-matrix.md per review feedback and added a concise scope sentence under the image-captioning section in multimodal-extraction.md; simplified OCR deploy prose and added a Docker Compose note to the FAQ.

Confidence Score: 5/5

Safe to merge — purely a documentation restructuring with no runtime code changes.

Every change is confined to Markdown source files, mkdocs.yml, and one test-list update. The anchor renames are complete across all referencing files, the MkDocs redirect follows the same fragment-redirect pattern already used in the project (e.g. extraction/nemoretriever-parse.md → multimodal-extraction.md#text-and-layout-extraction), and the legacy span alias preserves external bookmarks. The doc-snippet test file is correctly updated to drop the deleted file. No logic, API surface, or configuration is touched.

No files require special attention.

Important Files Changed

Filename Overview
docs/docs/extraction/custom-metadata.md File deleted — content consolidated into vdbs.md#metadata-and-filtering; redirect added in mkdocs.yml; cross-links updated in three other docs
docs/docs/extraction/vdbs.md Metadata-and-filtering section expanded to absorb custom-metadata.md content; two notebook links added (one pre-existing, one new); old custom-metadata.md cross-links removed from More information and Related Topics sections
docs/docs/extraction/prerequisites-support-matrix.md Explicit section anchor IDs added to three headings; image-captioning section renamed from #image-captioning-2605 to #image-captioning-nim-hardware with legacy span alias; chart-caption admonition removed and scope prose moved to multimodal-extraction.md
docs/docs/extraction/multimodal-extraction.md Charts-and-infographics and image-captioning sections retargeted from stale #image-captioning-2605 anchor to in-page or new anchors; brief scope prose added under image-captioning; OCR deploy paragraph simplified
docs/docs/extraction/faq.md Stale #image-captioning-2605 link replaced with two in-doc anchors in multimodal-extraction.md; Docker Compose note added to authentication FAQ answer
docs/docs/extraction/deployment-options.md One-line anchor update: #image-captioning-2605 → #image-captioning-nim-hardware for offline image captioning link
docs/mkdocs.yml Custom metadata nav section removed (sections 8–13 renumbered to 7–12); extraction/custom-metadata.md redirect to vdbs.md#metadata-and-filtering added alongside existing fragment-redirect pattern
nemo_retriever/tests/test_src_documentation_snippets.py Deleted custom-metadata.md correctly removed from the doc-snippet test file list

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["custom-metadata.md\n(deleted)"] -->|"redirect\nextraction/custom-metadata.md\n→ vdbs.md#metadata-and-filtering"| B["vdbs.md\n#metadata-and-filtering\n(expanded)"]

    C["faq.md"] -->|"old: prerequisites-support-matrix.md\n#image-captioning-2605"| D_old["❌ stale anchor"]
    C -->|"new: multimodal-extraction.md\n#charts-and-infographics\n#image-captioning"| E["multimodal-extraction.md\n(caption scope prose added)"]

    F["deployment-options.md"] -->|"old: #image-captioning-2605"| D_old
    F -->|"new: #image-captioning-nim-hardware"| G

    G["prerequisites-support-matrix.md\n### Image captioning\n{ #image-captioning-nim-hardware }\n+ legacy span #image-captioning-2605"]

    H["integrations-langchain-llamaindex-haystack.md"] -->|"old: custom-metadata.md"| A
    H -->|"new: vdbs.md#metadata-and-filtering"| B

    I["workflow-agentic-retrieval.md"] -->|"old: custom-metadata.md"| A
    I -->|"new: vdbs.md#metadata-and-filtering"| B
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["custom-metadata.md\n(deleted)"] -->|"redirect\nextraction/custom-metadata.md\n→ vdbs.md#metadata-and-filtering"| B["vdbs.md\n#metadata-and-filtering\n(expanded)"]

    C["faq.md"] -->|"old: prerequisites-support-matrix.md\n#image-captioning-2605"| D_old["❌ stale anchor"]
    C -->|"new: multimodal-extraction.md\n#charts-and-infographics\n#image-captioning"| E["multimodal-extraction.md\n(caption scope prose added)"]

    F["deployment-options.md"] -->|"old: #image-captioning-2605"| D_old
    F -->|"new: #image-captioning-nim-hardware"| G

    G["prerequisites-support-matrix.md\n### Image captioning\n{ #image-captioning-nim-hardware }\n+ legacy span #image-captioning-2605"]

    H["integrations-langchain-llamaindex-haystack.md"] -->|"old: custom-metadata.md"| A
    H -->|"new: vdbs.md#metadata-and-filtering"| B

    I["workflow-agentic-retrieval.md"] -->|"old: custom-metadata.md"| A
    I -->|"new: vdbs.md#metadata-and-filtering"| B
Loading

Reviews (13): Last reviewed commit: "Revert "docs: drop 26.05 labels from min..." | Re-trigger Greptile

@kheiss-uwzoo kheiss-uwzoo force-pushed the docs/backport-2194-extraction-docs-fix branch from 477f48a to 9a24d31 Compare June 2, 2026 22:55
@kheiss-uwzoo kheiss-uwzoo changed the title backport PR #2194 extraction doc fixes to main docs(extraction): backport PR #2194 extraction doc fixes to main Jun 2, 2026
@kheiss-uwzoo kheiss-uwzoo force-pushed the docs/backport-2194-extraction-docs-fix branch from 9a24d31 to 019547d Compare June 2, 2026 23:01
Comment thread docs/docs/extraction/custom-metadata.md Outdated
Comment thread docs/docs/extraction/multimodal-extraction.md Outdated
kheiss-uwzoo added a commit to kheiss-uwzoo/nv-ingest that referenced this pull request Jun 5, 2026
…a page, remove chart admonition

Remove custom-metadata.md in favor of vdbs.md#metadata-and-filtering and the metadata filtering notebook. Drop the PDF chart caption admonition from multimodal-extraction.md per review feedback.
@kheiss-uwzoo kheiss-uwzoo requested a review from randerzander June 5, 2026 17:54
@kheiss-uwzoo kheiss-uwzoo changed the title docs(extraction): backport PR #2194 extraction doc fixes to main backport PR #2194 extraction doc fixes to main Jun 5, 2026
@kheiss-uwzoo kheiss-uwzoo self-assigned this Jun 8, 2026
@kheiss-uwzoo kheiss-uwzoo added the doc Improvements or additions to documentation label Jun 8, 2026
Comment thread docs/docs/extraction/multimodal-extraction.md Outdated
@kheiss-uwzoo kheiss-uwzoo requested a review from jperez999 June 10, 2026 20:13
PR NVIDIA#2194 merged into 26.05 on 2026-06-02 but never reached main. This
backport keeps main aligned with the release branch and the published
docs.nvidia.com site after Randy's follow-up review.

Timeline:
- Friday: 26.05 docs built for docs.nvidia upload; branch differed from
  NRL GitHub Pages source and the uploaded docs were incorrect.
- Saturday: diff main vs 26.05 produced PR NVIDIA#2179 to sync extraction docs.
- Monday: PR NVIDIA#2179 merged and docs uploaded to the public site.
- Follow-up: Randy opened PR NVIDIA#2194 on 26.05 with additional fixes found
  after the NVIDIA#2179 sync. Those fixes landed on 26.05 only.
- This commit: cherry-pick of c5b257e onto main (five extraction doc
  files only).

Changes from NVIDIA#2194:
- Fix audio-video.md indented code block rendering
- Restore custom-metadata example service variables and storage prose
- Move caption scope admonition to multimodal-extraction.md
- Trim redundant Helm/OCR deploy detail per review feedback
- Restore FAQ Docker Compose note and support-matrix section anchors
…a page, remove chart admonition

Remove custom-metadata.md in favor of vdbs.md#metadata-and-filtering and the metadata filtering notebook. Drop the PDF chart caption admonition from multimodal-extraction.md per review feedback.
Rename the support-matrix caption section for main and keep a legacy
#image-captioning-2605 alias so existing deep links keep working.
@kheiss-uwzoo kheiss-uwzoo force-pushed the docs/backport-2194-extraction-docs-fix branch from 6a46fb2 to 56fa45c Compare June 11, 2026 23:34
Resolve modify/delete conflict on custom-metadata.md by keeping the PR deletion (content consolidated in vdbs.md with redirect). Bring in main mkdocstrings path fixes and support-matrix updates.
…A#2203

Point deployment-options.md at #image-captioning-nim-hardware. Add one sentence under multimodal-extraction #image-captioning so FAQ cross-references have scope detail without restoring the admonition.
@kheiss-uwzoo kheiss-uwzoo changed the title backport PR #2194 extraction doc fixes to main docs(extraction): sync post-#2179 extraction doc fixes to main Jun 15, 2026

@jperez999 jperez999 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one thing, dont know if it is an artifact or actually what we want to reference.

### Image captioning { #image-captioning-nim-hardware }

For 26.05, use **`nemotron_3_nano_omni_30b_a3b_reasoning`** when you enable the caption stage (hosted model ID `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning`). The Helm key is in the [optional NIMs](#optional-helm-nims-not-auto-wired-by-default) table above.
<span id="image-captioning-2605"></span>

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this still be here? its referring to 2605?

Rename the Helm README minimal-install section, keep a legacy
#recommended-minimal-install-2605 alias, point extraction docs at
blob/main, and use version-neutral prose in chart default notes.
- **Published guide** — [Custom metadata and filtering](custom-metadata.md) (sidecar `meta_*` on `vdb_upload`, compact JSON in LanceDB, server-side `where` on `Retriever.query`, and client-side `filter_hits_by_content_metadata`).
- **Canonical reference** — [Vector DB operators and LanceDB — Metadata filtering](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) in `nemo_retriever/src/nemo_retriever/vdb/README.md` (operator behavior and examples).
- [Metadata filtering notebook](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb) — end-to-end ingest, `Retriever.query`, and both filter modes
- [Sidecar metadata ingest](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb) — CLI and graph workflow

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Dead link — notebook does not exist yet

The file examples/metadata_and_filtered_search.ipynb is not present in the repository (only examples/nemo_retriever_retriever_query_metadata_filter.ipynb exists). Anyone following this link from the published docs will hit a GitHub 404. The PR description calls this notebook out as a follow-up engineering item — the link should either be removed until the notebook lands, or replaced with the existing nemo_retriever_retriever_query_metadata_filter.ipynb which already covers end-to-end ingest and both filter modes.

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/docs/extraction/vdbs.md
Line: 97

Comment:
**Dead link — notebook does not exist yet**

The file `examples/metadata_and_filtered_search.ipynb` is not present in the repository (only `examples/nemo_retriever_retriever_query_metadata_filter.ipynb` exists). Anyone following this link from the published docs will hit a GitHub 404. The PR description calls this notebook out as a follow-up engineering item — the link should either be removed until the notebook lands, or replaced with the existing `nemo_retriever_retriever_query_metadata_filter.ipynb` which already covers end-to-end ingest and both filter modes.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants