Keywords: omni-multimodal large language models, identity-aware, video understanding
TL;DR: An end-to-end omni-multimodal language model for id-aware video understanding
Abstract: Movie understanding is still challenging as a movie involves many characters with complex relationships and it is edited with artistic language for appealing audiences, which are neglected in current multimodal large language models (MLLMs). Only a few previous works propose ideas to identify characters and integrate ID information in models, but they use cascaded models, or vision and scripts only while ignoring audio. To address these problems, we propose an all-in-one Omni-MLLM with built-in basic capabilities of ID identification, shot-level description, and critical sub-question answer in thinking. First, we construct identity related data consisting of 12 fine-grained character-centric tasks to improve the model's ability to identify characters from vision and audio. Second, we leverage frame and shot descriptions to alleviate the difficulty of training. Third, we explore how to enhance our model further by using Chain of Thought (CoT) data from an advanced model. Experimental results show that our proposed model achieves stable improvements on both ID-Aware Movies Understanding questions' set StoryQA and general video understanding benchmark VideoMME. Ablation studies confirm the positive contributions from all of our proposed ideas.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 25627
Loading