Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach

Changdae Oh; Zhen Fang; Shawn Im; Xuefeng Du; Yixuan Li

Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach

Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, Yixuan Li

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose the first formal framework for characterizing the performance gap of multimodal LLMs under distribution shifts using information-theoretic metrics.

Abstract: Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts, where evaluation data differ from instruction tuning distributions. Although previous works have provided empirical evaluations, we argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application of MLLMs in the real world. By taking an information-theoretic perspective, we propose the first theoretical framework that enables the quantification of the maximum risk of MLLMs under distribution shifts. Central to our framework is the introduction of Effective Mutual Information (EMI), a principled metric that quantifies the relevance between input queries and model responses. We derive an upper bound for the EMI difference between in-distribution (ID) and out-of-distribution (OOD) data, connecting it to visual and textual distributional discrepancies. Extensive experiments on real benchmark datasets, spanning 61 shift scenarios, empirically validate our theoretical insights.

Lay Summary: (1) Multimodal LLMs often struggle to address unfamiliar queries, whereas proficient at processing ones similar to instruction-tuning data distribution; but there is no formal framework to explain this performance gap. (2) We present the first formal framework to characterize and quantify the performance gap of multimodal LLMs under these query distribution shifts through the lens of information theory. (3) The proposed information-theoretic framework can be efficiently leveraged for reliable multimodal LLM evaluation in safety-critical real-world applications.

Link To Code: https://github.com/deeplearning-wisc/mllmshift-emi

Primary Area: Deep Learning->Large Language Models

Keywords: multimodal large language models, distribution shifts, robustness, trustworthy AI

Submission Number: 2952

Loading