TAMP: Task-aware Multimodal Pre-Interaction for fine-grained Large Language Models

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Fine-grained, Multimodal Large Language Model, Detector-free
Abstract: Current Multimodal Large Language Models (MLLMs) primarily rely on image-level visual-linguistic alignment, limiting their capability in fine-grained visual perception tasks. Existing solutions either serialize coordinates as text inputs, which lose spatial semantics, or introduce specialized expert modules that increase inference latency and exhibit task bias. To address these limitations, we propose TAMP, a Task-aware Multimodal Pre-Interaction for Fine-Grained Multi-modal LLMs, that automatically recognizes key task-relevant information from instructions and extracts corresponding region features through an unified and detector-free paradigm. A task-aware region connector with a dual-branch is designed that dynamically handles both referring and grounding tasks. By introducing a instruction template with region placeholders, we seamlessly integrate fine-grained region features into the LLM's reasoning process. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on both referring and grounding benchmarks while maintaining strong general VQA capabilities.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5916
Loading