Videomind: A Role-based Agent For Temporal-grounded Video Understanding

Trending 2 weeks ago
ARTICLE AD BOX

LLMs person shown awesome capabilities successful reasoning tasks for illustration Chain-of-Thought (CoT), enhancing accuracy and interpretability successful analyzable problem-solving. While researchers are extending these capabilities to multi-modal domains, videos coming unsocial challenges owed to their temporal dimension. Unlike fixed images, videos require knowing move interactions complete time. Current ocular CoT methods excel pinch fixed inputs but struggle pinch video contented because they cannot explicitly localize aliases revisit circumstantial moments successful sequences. Humans flooded these challenges by breaking down analyzable problems, identifying and revisiting cardinal moments, and synthesizing observations into coherent answers. This attack highlights nan request for AI systems to negociate aggregate reasoning abilities.

Recent video knowing advances person improved tasks for illustration captioning and mobility answering, but models often deficiency visual-grounded correspondence and interpretability, particularly for long-form videos. Video Temporal Grounding addresses this by requiring precise localization. Large Multimodal Models trained pinch supervised instruction-tuning struggle pinch analyzable reasoning tasks. Two awesome approaches person emerged to reside these limitations: agent-based interfaces and axenic text-based reasoning paradigms exemplified by CoT processes. Moreover, Inference-time searching techniques are valuable successful domains for illustration robotics, games, and navigation by allowing models to iteratively refine outputs without changing underlying weights.

Researchers from nan Hong Kong Polytechnic University and Show Lab, National University of Singapore, person projected VideoMind, a video-language supplier designed for temporal-grounded video understanding. VideoMind introduces 2 cardinal innovations to reside nan challenges of video reasoning. First, it identifies basal capabilities for video temporal reasoning and implements a role-based agentic workflow pinch specialized components: a planner, a grounder, a verifier, and an answerer. Second, it proposes a Chain-of-LoRA strategy that enables seamless role-switching done lightweight LoRA adaptors, avoiding nan overhead of aggregate models while balancing ratio and flexibility. Experiments crossed 14 nationalist benchmarks show state-of-the-art capacity successful divers video knowing tasks.

VideoMind builds upon nan Qwen2-VL, combining an LLM backbone pinch a ViT-based ocular encoder tin of handling move solution inputs. Its halfway invention is its Chain-of-LoRA strategy, which dynamically activates role-specific LoRA adapters during conclusion via self-calling. Moreover, it contains 4 specialized components: (a) Planner, which coordinates each different roles and determines which usability to telephone adjacent based connected query, (b) Grounder, which localizes applicable moments by identifying commencement and extremity timestamps based connected matter queries (c) Verifier, which provides binary (“Yes”/”No”) responses to validate temporal intervals and (d) Answerer, which generates responses based connected either cropped video segments identified by nan Grounder aliases nan full video erstwhile nonstop answering is much appropriate.

In grounding metrics, VideoMind’s lightweight 2B exemplary outperforms astir compared models, including InternVL2-78B and Claude-3.5-Sonnet, pinch only GPT-4o showing superior results. However, nan 7B type of VideoMind surpasses moreover GPT-4o, achieving competitory wide performance. On nan NExT-GQA benchmark, nan 2B exemplary matches state-of-the-art 7B models crossed some agent-based and end-to-end approaches, comparing favorably pinch text-rich, agent-based solutions for illustration LLoVi, LangRepo, and SeViLA. VideoMind shows exceptional zero-shot capabilities, outperforming each LLM-based temporal grounding methods and achieving competitory results compared to fine-tuned temporal grounding experts. Moreover, VideoMind excels successful wide video QA tasks crossed Video-MME (Long), MLVU, and LVBench, showing effective localization of cue segments earlier answering questions.

In this paper, researchers introduced VideoMind, a important advancement successful temporal grounded video reasoning. It addresses nan analyzable challenges of video knowing done agentic workflow, combining a Planner, Grounder, Verifier, Answerer, and an businesslike Chain-of-LoRA strategy for role-switching. Experiments crossed 3 cardinal domains, grounded video question-answering, video temporal grounding, and wide video question-answering, corroborate VideoMind’s effectiveness for long-form video reasoning tasks wherever it provides precise, evidence-based answers. This activity establishes a instauration for early developments successful multimodal video agents and reasoning capabilities, opening caller pathways for much analyzable video knowing systems.


Check out the Paper and Project Page. All in installments for this investigation goes to nan researchers of this project. Also, feel free to travel america on Twitter and don’t hide to subordinate our 85k+ ML SubReddit.

Sajjad Ansari is simply a last twelvemonth undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into nan applicable applications of AI pinch a attraction connected knowing nan effect of AI technologies and their real-world implications. He intends to articulate analyzable AI concepts successful a clear and accessible manner.

More
rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy rb.gy