Mulan: A Multi-Level Alignment Model for Video Question Answering

Yu Fu; Cong Cao; Yuling Yang; Yuhai Lu; Fangfang Yuan; Dakui Wang; Yanbing Liu

Mulan: A Multi-Level Alignment Model for Video Question Answering

Yu Fu, Cong Cao, Yuling Yang, Yuhai Lu, Fangfang Yuan, Dakui Wang, Yanbing Liu

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Question Answering

Submission Track 2: Speech and Multimodality

Keywords: Video Question Answering, Multi-Level Alignment, Contrastive Learning

Abstract: Video Question Answering (VideoQA) aims to answer questions about the visual content of a video. Current methods mainly focus on improving joint representations of video and text. However, these methods pay little attention to the fine-grained semantic interaction between video and text. In this paper, we propose Mulan: a Multi-Level Alignment Model for Video Question Answering, which establishes alignment between visual and textual modalities at the object-level, frame-level, and video-level. Specifically, for object-level alignment, we propose a mask-guided visual feature encoding method and a visual-guided text description method to learn fine-grained spatial information. For frame-level alignment, we introduce the use of visual features from individual frames, combined with a caption generator, to learn overall spatial information within the scene. For video-level alignment, we propose an expandable ordinal prompt for textual descriptions, combined with visual features, to learn temporal information. Experimental results show that our method outperforms the state-of-the-art methods, even when utilizing the smallest amount of extra visual-language pre-training data and a reduced number of trainable parameters.

Submission Number: 3375

Loading