Track: regular paper (up to 6 pages)
Keywords: Multimodal, In-context Learning, Large Vision-language Models
Abstract: The performance of Large Vision-Language Models (LVLMs) during in-context learning (ICL) is heavily influenced by shortcut learning, especially in tasks that demand robust multimodal reasoning and open-ended generation. To mitigate this, we introduce task mapping as a novel framework for analyzing shortcut learning and demonstrate that conventional ICD selection methods can disrupt the coherence of task mappings. Building on these insights, we propose Ta-ICL, a task-aware model that enhances task mapping cohesion through task-aware attention and autoregressive retrieval. Extensive experiments on diverse vision-language tasks show that Ta-ICL significantly reduces shortcut learning, improves reasoning consistency, and boosts LVLM adaptability. These findings underscore the potential of task mapping as a key strategy for refining multimodal reasoning, paving the way for more robust and generalizable ICL frameworks.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 66
Loading