Abstract: Multimodal large language models (MLLMs) such as GPT-4o, Gemini Pro, and Claude 3.5 have enabled unified reasoning over text and visual inputs, yet they often hallucinate in real-world scenarios—especially when small objects or fine spatial context are involved. We pinpoint two core causes of this failure: the absence of region-adaptive attention and inflexible token budgets that force uniform downsampling, leading to critical information loss. To overcome these limitations, we introduce \SysName, a visual prompting framework that delivers token-efficient, detail-preserving image representations for black-box MLLMs. \SysName integrates (1) a prompt-aware emphasis module to highlight semantically relevant regions, (2) a spatial-preserving orchestration schema to maintain object relationships, and (3) a budget-aware strategy to optimally allocate tokens between global context and local details. Extensive experiments on nine benchmarks and three commercial MLLMs demonstrate that \SysName boosts accuracy by up to 27\% while cutting image token usage by up to 67\%. Our approach establishes a principled methodology for robust, resource-aware multimodal understanding in settings where model internals are inaccessible.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Ying_Wei1
Submission Number: 5504
Loading