Skip to content

ViDoRAG vs ColPali+VLM for RAG on screenshot documents? #42

@Hert4

Description

@Hert4

Evaluating ViDoRAG vs a simpler ColPali + VLM generation pipeline. Specifically:

  1. For documents that are predominantly UI screenshots rather than text-rich pages, does ViDoRAG's GMM-based hybrid retrieval (visual + textual) provide meaningful gains over ColPali's vision-only retrieval? Since raw documents have minimal extractable text, the textual pipeline in the hybrid approach may not contribute much.

  2. Does the multi-agent iterative reasoning in ViDoRAG help with queries that require understanding spatial relationships within UI screenshots (e.g., "which field should I fill in for employee department code?")?

  3. What is the minimum infrastructure requirement to run ViDoRAG end-to-end? Considering self-hosted deployment and want to understand GPU/mem

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions