Diagnosing the Curse: A Scale-Consistent and All-Phase Metric for Modality Bias in MLLMs

Published: 02 Mar 2026, Last Modified: 13 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0
Track: tiny paper (up to 4 pages)
Keywords: MLLMs, Modality Bias, Evaluation
Abstract: Quantifying modality bias in multimodal large language models (MLLMs) plays a key role in diagnosing how these models reason across different input modalities. However, we identify that existing attention-based metrics suffer from **the scaling paradox** and **failure of the aggregation strategy**. 1. As image resolution increases, the quadratic expansion of visual tokens mathematically induces denominator-driven drift in per-token attention metrics, causing standard metrics to spuriously report extreme text dominance. 2. The existing sparse bias aggregation strategy by layer masks the true representation of modality bias, failing to correctly measure modality bias. To resolve these, we propose **Depth-wise Stratified Modality Dominance (DSMD)**. By conditioning attention analysis on input token-count quantiles, DSMD decouples reasoning preference from token numbers. Furthermore, it incorporates an accuracy-weighted aggregation to pinpoint the layers driving correct predictions. Experiments on Qwen2.5-VL ($112^2$ to $896^2$) demonstrate that DSMD eliminates the spurious divergence observed in baselines, correctly reflecting the saturation of visual benefit.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 8
Loading