Evaluating ViDoRAG vs a simpler ColPali + VLM generation pipeline. Specifically:
-
For documents that are predominantly UI screenshots rather than text-rich pages, does ViDoRAG's GMM-based hybrid retrieval (visual + textual) provide meaningful gains over ColPali's vision-only retrieval? Since raw documents have minimal extractable text, the textual pipeline in the hybrid approach may not contribute much.
-
Does the multi-agent iterative reasoning in ViDoRAG help with queries that require understanding spatial relationships within UI screenshots (e.g., "which field should I fill in for employee department code?")?
-
What is the minimum infrastructure requirement to run ViDoRAG end-to-end? Considering self-hosted deployment and want to understand GPU/mem
Evaluating ViDoRAG vs a simpler ColPali + VLM generation pipeline. Specifically:
For documents that are predominantly UI screenshots rather than text-rich pages, does ViDoRAG's GMM-based hybrid retrieval (visual + textual) provide meaningful gains over ColPali's vision-only retrieval? Since raw documents have minimal extractable text, the textual pipeline in the hybrid approach may not contribute much.
Does the multi-agent iterative reasoning in ViDoRAG help with queries that require understanding spatial relationships within UI screenshots (e.g., "which field should I fill in for employee department code?")?
What is the minimum infrastructure requirement to run ViDoRAG end-to-end? Considering self-hosted deployment and want to understand GPU/mem