Keywords: Video Understanding and Generation, Dataset and Benchmark, Multimodal, Music
Abstract: Integrating multimodal understanding and generation into a unified framework can bridge the domain gap across different modalities.
However, existing multimodal-language datasets predominantly offer text descriptions for a single modality, treating visual and audio as separate tasks. This approach neglects the inherent audio-visual correlations, resulting in annotations that are often monotonous and modality-specific rather than comprehensive and precise. Such oversight hampers the advancement of cross-modality research. To fulfill this gap, we present ViML, a large-scale multi-modality-to-language dataset incorporating 3M video clips with high-quality multimodal captions.
In ViML, we propose a systemic captioning framework, achieving various modality annotations with more than 12.2k hours of trailer videos. Here, to ensure the caption retains music perspective while preserving the authority of visual context, we leverage the advanced LLM to merge all annotations adaptively. In particular, the ViML has two main advantages:
(1) the topics are diverse, and the content characters are of various types, \eg, film, news, and gaming.
(2) the corresponding background music is custom-designed, making it more coherent with the visual context.
In this fashion, our ViML dataset potentially paves the path for fine-grained large multimodal-language model training. In experiments, we provide evaluation metrics and benchmark results on our dataset, demonstrating the high quality of our annotation and its effectiveness for model training. We include demo data in \hyperlink{https://anonymous.4open.science/w/ViML-4C78}{https://anonymous.4open.science/w/ViML-4C78}
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6844
Loading