Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

TMLR Paper5504 Authors

30 Jul 2025 (modified: 18 Aug 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal large language models (MLLMs) such as GPT-4o, Gemini Pro, and Claude 3.5 have enabled unified reasoning over text and visual inputs, yet they often hallucinate in real-world scenarios—especially when small objects or fine spatial context are involved. We pinpoint two core causes of this failure: the absence of region-adaptive attention and inflexible token budgets that force uniform downsampling, leading to critical information loss. To overcome these limitations, we introduce \SysName, a visual prompting framework that delivers token-efficient, detail-preserving image representations for black-box MLLMs. \SysName integrates (1) a prompt-aware emphasis module to highlight semantically relevant regions, (2) a spatial-preserving orchestration schema to maintain object relationships, and (3) a budget-aware strategy to optimally allocate tokens between global context and local details. Extensive experiments on nine benchmarks and three commercial MLLMs demonstrate that \SysName boosts accuracy by up to 27\% while cutting image token usage by up to 67\%. Our approach establishes a principled methodology for robust, resource-aware multimodal understanding in settings where model internals are inaccessible.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Ying_Wei1

Submission Number: 5504

Loading