Abstract: The performance of Large Vision-Language Models (LVLMs) in In-Context Learning (ICL) is heavily influenced by the quality of ICL sequences, particularly in tasks requiring cross-modal reasoning and open-ended generation. To address this challenge, we innovatively interpret multimodal ICL from the perspective of task mapping. We systematically model local and global relationships within in-context demonstrations (ICDs) and demonstrate their core role and cohesion in enhancing LVLM performance. Inspired by these findings, we propose Ta-ICL, a lightweight transformer-based model equipped with task-aware attention to dynamically configure ICL sequences. By integrating task mapping into the autoregressive process, Ta-ICL achieves bidirectional enhancement between sequence configuration and task reasoning. Through extensive experiments, we demonstrate that Ta-ICL effectively improves multimodal ICL across various LVLMs and tasks. Our results highlight the potential of task mapping to be widely applied in enhancing multimodal reasoning, paving the way for robust and generalizable multimodal ICL frameworks.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, cross-modal application, cross-modal content generation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches low compute settings-efficiency, Theory
Languages Studied: English
Submission Number: 8183
Loading