LLaVA-Video: Video Instruction Tuning With Synthetic Data

LLaVA-Video: Video Instruction Tuning With Synthetic Data

TMLR Paper4754 Authors

29 Apr 2025 (modified: 16 May 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Boqing_Gong1

Submission Number: 4754

Loading