ARTICLE AD BOX
RAG has proven effective successful enhancing nan actual accuracy of LLMs by grounding their outputs successful external, applicable information. However, astir existing RAG implementations are constricted to text-based corpora, which restricts their applicability to real-world scenarios wherever queries whitethorn require divers types of information, ranging from textual definitions to spatial knowing from images aliases temporal reasoning from videos. While immoderate caller approaches person extended RAG to grip different modalities for illustration images and videos, these systems are often constrained to run wrong a azygous modality-specific corpus. This limits their expertise to efficaciously respond to a wide spectrum of personification queries that request multimodal reasoning. Moreover, existent RAG methods usually retrieve from each modalities without discerning which is astir applicable for a fixed query, making nan process inefficient and little adaptive to circumstantial accusation needs.
To reside this, caller investigation emphasizes nan request for adaptive RAG systems to find nan due modality and retrieval granularity based connected nan query context. Strategies see routing queries based connected complexity, specified arsenic deciding betwixt nary retrieval, single-step, aliases multi-step retrieval, and utilizing exemplary assurance to trigger retrieval only erstwhile needed. Furthermore, nan granularity of retrieval plays a important role, arsenic studies person shown that indexing corpora astatine finer levels, for illustration propositions aliases circumstantial video clips, tin importantly amended retrieval relevance and strategy performance. Hence, for RAG to genuinely support complex, real-world accusation needs, it must grip aggregate modalities and accommodate its retrieval extent and scope to nan circumstantial demands of each query.
Researchers from KAIST and DeepAuto.ai present UniversalRAG, a RAG model that retrieves and integrates knowledge from various modality-specific sources (text, image, video) and aggregate granularity levels. Unlike accepted approaches that embed each modalities into a shared space, starring to modality bias, UniversalRAG uses a modality-aware routing system to prime nan astir applicable corpus dynamically based connected nan query. It further enhances retrieval precision by organizing each modality into granularity-specific corpora, specified arsenic paragraphs aliases video clips. Validated connected 8 multimodal benchmarks, UniversalRAG consistently outperforms unified and modality-specific baselines, demonstrating its adaptability to divers query needs.
UniversalRAG is simply a retrieval-augmented procreation model that handles queries crossed various modalities and information granularities. Unlike modular RAG models constricted to a azygous corpus, UniversalRAG separates knowledge into text, image, and video corpora, each pinch fine- and coarse-grained levels. A routing module first determines nan optimal modality and granularity for a fixed query, choosing among options for illustration paragraphs, afloat documents, video clips, aliases afloat video, and retrieves applicable accusation accordingly. This router tin beryllium either a training-free LLM-based classifier aliases a trained exemplary utilizing heuristic labels from benchmark datasets. An LVLM past uses nan selected contented to make nan last response.
The experimental setup assesses UniversalRAG crossed six retrieval scenarios: nary retrieval, paragraph, document, image, clip, and video. For no-retrieval, MMLU tests wide knowledge. Paragraph-level tasks usage SQuAD and Natural Questions, while HotpotQA handles multi-hop archive retrieval. Image-based queries travel from WebQA, and video-related ones are originated from LVBench and VideoRAG datasets, divided into clip- and full-video levels. Corresponding retrieval corpora are curated for each modality—Wikipedia-based for text, WebQA for images, and YouTube videos for video tasks. This broad benchmark ensures robust information crossed varied modalities and retrieval granularities.

In conclusion, UniversalRAG is simply a Retrieval-Augmented Generation model that tin retrieve knowledge from aggregate modalities and levels of granularity. Unlike existing RAG methods that trust connected a single, often text-only, corpus aliases a single-modality source, UniversalRAG dynamically routes queries to nan astir due modality- and granularity-specific corpus. This attack addresses issues for illustration modality gaps and rigid retrieval structures. Evaluated connected 8 multimodal benchmarks, UniversalRAG outperforms some unified and modality-specific baselines. The study besides emphasizes nan benefits of fine-grained retrieval and highlights really some trained and train-free routing mechanisms lend to robust, elastic multimodal reasoning.
Check retired nan Paper. Also, don’t hide to travel america on Twitter and subordinate our Telegram Channel and LinkedIn Group. Don’t Forget to subordinate our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.
🔥 [Register Now] miniCON Virtual Conference connected AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 p.m. PST) + Hands connected Workshop
Sana Hassan, a consulting intern astatine Marktechpost and dual-degree student astatine IIT Madras, is passionate astir applying exertion and AI to reside real-world challenges. With a keen liking successful solving applicable problems, he brings a caller position to nan intersection of AI and real-life solutions.