Vismap: Unsupervised Summarization Of Hour-long Videos Using Meta-prompting And Short-form Datasets

Trending 3 weeks ago
ARTICLE AD BOX

Video captioning models are typically trained connected datasets consisting of short videos, usually nether 3 minutes successful length, paired pinch corresponding captions. While this enables them to picture basal actions for illustration stepping aliases talking, these models struggle pinch nan complexity of long-form videos, specified arsenic vlogs, sports events, and movies that tin past complete an hour. When applied to specified videos, they often make fragmented descriptions focused connected isolated actions alternatively than capturing nan broader storyline. Efforts for illustration MA-LMM and LaViLa person extended video captioning to 10-minute clips utilizing LLMs, but hour-long videos stay a situation owed to a shortage of suitable datasets. Although Ego4D introduced a ample dataset of hour-long videos, its first-person position limits its broader applicability. Video ReCap addressed this spread by training connected hour-long videos pinch multi-granularity annotations, yet this attack is costly and prone to note inconsistencies. In contrast, annotated short-form video datasets are wide disposable and much user-friendly.

Advancements successful visual-language models person importantly enhanced nan integration of imagination and connection tasks, pinch early useful specified arsenic CLIP and ALIGN laying nan foundation. Subsequent models, specified arsenic LLaVA and MiniGPT-4, extended these capabilities to images, while others adapted them for video knowing by focusing connected temporal series modeling and constructing much robust datasets. Despite these developments, nan scarcity of large, annotated long-form video datasets remains a important hindrance to progress. Traditional short-form video tasks, for illustration video mobility answering, captioning, and grounding, chiefly require spatial aliases temporal understanding, whereas summarizing hour-long videos demands identifying cardinal frames amidst important redundancy. While immoderate models, specified arsenic LongVA and LLaVA-Video, tin execute VQA connected agelong videos, they struggle pinch summarization tasks owed to information limitations.

Researchers from Queen Mary University and Spotify present ViSMaP, an unsupervised method for summarising hour-long videos without requiring costly annotations. Traditional models execute good connected short, pre-segmented videos but struggle pinch longer contented wherever important events are scattered. ViSMaP bridges this spread by utilizing LLMs and a meta-prompting strategy to iteratively make and refine pseudo-summaries from clip descriptions created by short-form video models. The process involves 3 LLMs moving successful series for generation, evaluation, and punctual optimisation. ViSMaP achieves capacity comparable to afloat supervised models crossed aggregate datasets while maintaining domain adaptability and eliminating nan request for extended manual labelling.

The study addresses cross-domain video summarization by training connected a labelled short-form video dataset and adapting to unlabelled, hour-long videos from a different domain. Initially, a exemplary is trained to summarize 3-minute videos utilizing TimeSFormer features, a visual-language alignment module, and a matter decoder, optimized by cross-entropy and contrastive losses. To grip longer videos, they are segmented into 3-minute clips, and pseudo-captions are generated. An iterative meta-prompting attack pinch aggregate LLMs (generator, evaluator, optimizer) refines summaries. Finally, nan exemplary is fine-tuned connected these pseudo-summaries utilizing a symmetric cross-entropy nonaccomplishment to negociate noisy labels and amended adaptation.

The study evaluates VisMaP crossed 3 scenarios: summarization of agelong videos utilizing Ego4D-HCap, cross-domain generalization connected MSRVTT, MSVD, and YouCook2 datasets, and adjustment to short videos utilizing EgoSchema. VisMaP, trained connected hour-long videos, is compared against supervised and zero-shot methods, specified arsenic Video ReCap and LaViLa+GPT3.5, demonstrating competitory aliases superior capacity without supervision. Evaluations usage CIDEr, ROUGE-L, METEOR scores, and QA accuracy. Ablation studies item nan benefits of meta-prompting and constituent modules, specified arsenic contrastive learning and SCE loss. Implementation specifications see nan usage of TimeSformer, DistilBERT, and GPT-2, pinch training conducted connected an NVIDIA A100 GPU.

In conclusion, ViSMaP is an unsupervised attack for summarizing agelong videos by utilizing annotated short-video datasets and a meta-prompting strategy. It first creates high-quality summaries done meta-prompting and past trains a summarization model, reducing nan request for extended annotations. Experimental results show that ViSMaP performs connected par pinch afloat supervised methods and adapts efficaciously crossed various video datasets. However, its reliance connected pseudo labels from a source-domain exemplary whitethorn effect capacity nether important domain shifts. Additionally, ViSMaP presently relies solely connected ocular information. Future activity could merge multimodal data, present hierarchical summarization, and create much generalizable meta-prompting techniques.


Check retired nan Paper. Also, don’t hide to travel america on Twitter and subordinate our Telegram Channel and LinkedIn Group. Don’t Forget to subordinate our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference connected AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 p.m. PST) + Hands connected Workshop

Sana Hassan, a consulting intern astatine Marktechpost and dual-degree student astatine IIT Madras, is passionate astir applying exertion and AI to reside real-world challenges. With a keen liking successful solving applicable problems, he brings a caller position to nan intersection of AI and real-life solutions.

More