MM-SHAP: A Performance-agnostic Metric for Interpreting Multimodal Contributions in Vision and Language Models & TasksDownload PDF

Anonymous

03 Sept 2022 (modified: 05 May 2023)ACL ARR 2022 September Blind SubmissionReaders: Everyone
Abstract: Vision and language models (VL) are known to exploit unrobust indicators in individual modalities (e.g., introduced by distributional biases), instead of focusing on relevant information in each modality. A small drop in accuracy obtained on a VL task with a unimodal model suggest that so-called unimodal collapse occurs. But how to quantify the amount of unimodal collapse, i.e., how multimodal are VL models really?We present MM-SHAP, a performance-agnostic multimodality score that quantifies the proportion by which a model uses individual modalities in multimodal tasks.MM-SHAP is based on Shapley values and will be applied in two ways: (1) we compare models for their degree of multimodality, and (2) measure the importance of individual modalities for a given task and dataset.Experiments with 6 VL models -- LXMERT, CLIP and four ALBEF variants -- on four VL tasks -- image-sentence-alignment, Visual Question Answering, GQA and the more fine-grained VALSE benchmark -- highlight that unimodal collapse can occur to different degrees and in different directions, contradicting the wide-spread assumption that unimodal collapse is one-sided. We recommend MM-SHAP to complement accuracy metrics when analysing multimodal tasks, as this can help guide progress towards multimodal integration.
Paper Type: long
0 Replies

Loading