LLaVA-Video: Video Instruction Tuning With Synthetic Data

TMLR Paper4754 Authors

29 Apr 2025 (modified: 16 May 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Boqing_Gong1
Submission Number: 4754
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview