Are VideoQA Models Truly Multimodal?

Published: 27 Oct 2023, Last Modified: 21 Nov 2023NeurIPS XAIA 2023EveryoneRevisionsBibTeX
Abstract: While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models jointly capture and leverage the rich multimodal structures and dynamics from video and text? Or are they merely exploiting shortcuts to achieve high scores? Hence, we design $\textit{QUAG}$ (QUadrant AveraGe), a lightweight and non-parametric probe, to critically analyze multimodal representations. QUAG facilitates combined dataset-model study by systematic ablation of model's coupled multimodal understanding during inference. Surprisingly, it demonstrates that the models manage to maintain high performance even under multimodal impairment. This indicates that the current VideoQA benchmarks and metrics do not penalize models that find shortcuts and discount joint multimodal understanding. Motivated by this, we propose $\textit{CLAVI}$ (Counterfactual in LAnguage and VIdeo), a diagnostic dataset for coupled multimodal understanding in VideoQA. CLAVI consists of temporal questions and videos that are augmented to curate balanced counterfactuals in language and video domains. We evaluate models on CLAVI and find that all models achieve high performance on multimodal shortcut instances, but most of them have very poor performance on the counterfactual instances that necessitate joint multimodal understanding. Overall, we show that many VideoQA models are incapable of learning multimodal representations and that their success on standard datasets is an illusion of joint multimodal understanding.
Submission Track: Full Paper Track
Application Domain: None of the above / Not applicable
Clarify Domain: Multimodal (vision-language)
Survey Question 1: Our work is based on understanding VideoQA -- if the seemingly multimodal task (that is, requiring the AI model to understand both text and question) is truly multimodal. That is, do the Transformer models actually learn to leverage the information within and between the modalities? Or do they mostly rely on shortcuts. The first part of our paper is based on explainability. We discuss QUAG, a simple approach for performing *combined dataset-model analysis* *without training* to reveal the failure modes of multimodal representations in these models. Based on the insights from QUAG, we then develop CLAVI, a litmus test dataset for joint multimodal understanding on which most of the models fail.
Survey Question 2: The VideoQA community has been long banking on baselines that are not informative (for example, language-only baseline for checking language bias, shuffling video-frames for temporal bias etc). However, these approaches are too specific and very brittle. They are not insightful as to *why* and *how* the model is failing (or if the model is actually creating an illusion of "success"). We solve these problems using QUAG and CLAVI.
Survey Question 3: We develop our own analysis method: QUAG. QUAG is lightweight (only a couple of lines to add in self-attention code), operates at inference time, non-parametric, intuitive, and gives the exact failure point of the models. It is modality agnostic and can be extended to any multimodal problem. We validate the findings from QUAG using our diagnostic dataset -- CLAVI.
Submission Number: 29
Loading