LongViTU: Instruction Tuning for Long-Form Video Understanding

26 Sept 2024 (modified: 13 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision language models, instruction-tuning, long-form video understanding
TL;DR: We propose a large-scale instruction-tuning dataset for long-form video understanding.
Abstract: This paper presents LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. Our key idea is inspired by the success of Large Language Models (LLMs) and Multimodal Language Models (MLMs) that are fueled by machine-generated instruction-following data (*e.g.*, InstructGPT, LLaVA). We developed a *systematic* approach to produce massive question-answeringing pairs tailored to virtually unbounded long videos by organizing them into a ***hierarchical tree***, incorporating ***self-revision*** mechanisms to guarantee high quality. We curate LongViTU for each QA pair: 1) involves a long context (average *certificate length* of 4.6 minutes); 2) requires rich knowledge and condensed reasoning (commonsense, causality, planning, *etc.*); 3) explicit labels the timestamps of relevant events throughout the entire video. Furthermore, LongViTU provides a benchmark to facilitate future research in instruction-following for long-form videos. Our experiments first reveal the performance gap between open-source video MLMs and their commercial counterparts (*e.g.*, Gemini-1.5-Pro) on this benchmark. Supervised Fine-Tuning (SFT) on open-source models led to Video-LLaVA achieving the best performance, with a GPT-4 score of $50.7$, closely following $52.3$ by the leading closed-source model Gemini-1.5-Pro, underscoring the substantial challenge posed by our benchmark. Further SFT on LongViTU with Video-LLaVA resulted in improvements of $30.7$% on the In-Distribution (ID) benchmark EgoSchema; $12.9$% and $0.6$% on the Out-of-Distribution (OOD) benchmarks WorldQA and VideoMME, respectively. These outcomes demonstrate the effectiveness and robust OOD generalizability of our proposed instruction-tuning scheme for long-form video understanding. The dataset, SFT models, and code are publicly available on the anonymous page [LongViTU](https://longvitu.github.io).
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7604
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview