A Video Is Not Worth a Thousand Words

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: feature attribution, shapley values, modality preference, visual question answering, video understanding
TL;DR: Using Shapley values to determine modality contributions in multiple-choice VQA with long context video.
Abstract: As we become increasingly dependent on vision language models (VLMs) to answer questions about the world around us, there is a significant amount of research devoted to increasing both the difficulty of video question answering (VQA) datasets, and the context lengths of the models that they evaluate. The reliance on large language models as backbones has lead to concerns about potential text dominance, and the exploration of interactions between modalities is underdeveloped. How do we measure whether we're heading in the right direction, with the complexity that multi-modal models introduce? We propose a joint method of computing both feature attributions and modality scores based on Shapley values, where both the features and modalities are arbitrarily definable. Using these metrics, we compare $6$ VLM models of varying context lengths on $4$ representative datasets, focusing on multiple choice VQA. In particular, we consider video frames and whole textual elements as equal features in the hierarchy, and the multiple choice VQA task as an interaction between three modalities: video, question and answer. Our results demonstrate a dependence on text and show that the multiple choice VQA task devolves into a model's ability to ignore distractors.
Primary Area: interpretability and explainable AI
Submission Number: 9007
Loading