Why Foundation Models Struggle with Cross-Modal Context

Published: 06 Mar 2025, Last Modified: 01 May 2025SCSL @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Track: regular paper (up to 6 pages)
Keywords: multilingual language models, multimodal language models, cross-modal reasoning
TL;DR: We find attention imbalance makes foundation models struggle with cross-modal contexts, and propose a solution.
Abstract: Foundation models (FMs) deployed in real-world tasks -- such as computer-use agents -- must integrate diverse modalities. How good are FMs at performing joint reasoning across information present in multiple modalities? To better understand this problem, we study FMs on cross-modal conflicts: scenarios where conflicting evidence is presented across modalities. This allows us to examine whether FMs prioritize one modality over another or reason jointly to reconcile the conflict. Our experiments reveal that FMs can recognize conflicts in unimodal contexts 90% of the time, but the ratio falls as low as 3% when evidence is split across modalities -- similar observations hold in cross-lingual settings. We trace this failure to cross-modal attention imbalance, showing that FMs exhibit extreme asymmetry in attention scores, disproportionately prioritizing certain modalities. We show that cross-modal attention imbalance does not go away by simply scaling up multimodal or multilingual datasets blindly, since they lack training examples that explicitly require cross-modal reasoning. We demonstrate that even a simple and scalable method of explicitly combining multiple modalities within each training instance significantly reduces attention imbalance. Our findings underscore the importance of systematically addressing cross-modal contexts to build reliable foundation models.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 25
Loading