ViDoRAG vs ColPali+VLM for RAG on screenshot documents?

Evaluating **ViDoRAG** vs a simpler **ColPali + VLM generation** pipeline. Specifically:

1. For documents that are predominantly UI screenshots rather than text-rich pages, does ViDoRAG's GMM-based hybrid retrieval (visual + textual) provide meaningful gains over ColPali's vision-only retrieval? Since raw documents have minimal extractable text, the textual pipeline in the hybrid approach may not contribute much.

2. Does the multi-agent iterative reasoning in ViDoRAG help with queries that require understanding spatial relationships within UI screenshots (e.g., "which field should I fill in for employee department code?")?

3. What is the minimum infrastructure requirement to run ViDoRAG end-to-end? Considering self-hosted deployment and want to understand GPU/mem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ViDoRAG vs ColPali+VLM for RAG on screenshot documents? #42

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ViDoRAG vs ColPali+VLM for RAG on screenshot documents? #42

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions