Unveiling and Mitigating Shortcuts in Multimodal In-Context Learning

Yanshu Li

Unveiling and Mitigating Shortcuts in Multimodal In-Context Learning

Yanshu Li

Published: 06 Mar 2025, Last Modified: 01 May 2025SCSL @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Track: regular paper (up to 6 pages)

Keywords: Multimodal, In-context Learning, Large Vision-language Models

Abstract: The performance of Large Vision-Language Models (LVLMs) during in-context learning (ICL) is heavily influenced by shortcut learning, especially in tasks that demand robust multimodal reasoning and open-ended generation. To mitigate this, we introduce task mapping as a novel framework for analyzing shortcut learning and demonstrate that conventional ICD selection methods can disrupt the coherence of task mappings. Building on these insights, we propose Ta-ICL, a task-aware model that enhances task mapping cohesion through task-aware attention and autoregressive retrieval. Extensive experiments on diverse vision-language tasks show that Ta-ICL significantly reduces shortcut learning, improves reasoning consistency, and boosts LVLM adaptability. These findings underscore the potential of task mapping as a key strategy for refining multimodal reasoning, paving the way for more robust and generalizable ICL frameworks.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 66

Loading