Zoomer: Enhancing MLLM Performance with Adaptive Image Focus Optimization

Jiaxu Qian; Chendong Wang; Yifan Yang; Chaoyun Zhang; Huiqiang Jiang; Xufang Luo; Yu Kang; Qingwei Lin; Anlan Zhang; Shiqi Jiang; Ting Cao; Tianjun Mao; Suman Banerjee; Guyue Liu; Saravan Rajmohan; Dongmei Zhang; Yuqing Yang; Qi Zhang; Lili Qiu

Zoomer: Enhancing MLLM Performance with Adaptive Image Focus Optimization

Jiaxu Qian, Chendong Wang, Yifan Yang, Chaoyun Zhang, Huiqiang Jiang, Xufang Luo, Yu Kang, Qingwei Lin, Anlan Zhang, Shiqi Jiang, Ting Cao, Tianjun Mao, Suman Banerjee, Guyue Liu, Saravan Rajmohan, Dongmei Zhang, Yuqing Yang, Qi Zhang, Lili Qiu

26 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal, MLLM, Prompt Engineering, Efficient, Token Compression

TL;DR: MLLMs struggle with precise object recognition. Zoomer improves MLLM performance by preserving visual details through dynamic image highlighting, spatial integrity, and efficient token use, boosting accuracy by up to 26.9% across datasets.

Abstract: Recent advancements in multimodal large language models (MLLMs) have broadened the scope of vision-language tasks, excelling in applications like image captioning and interactive question-answering. However, these models struggle with accurately processing visual data, particularly in tasks requiring precise object recognition and fine visual details. Stringent token limits often result in the omission of critical information, hampering performance. To address these limitations, we introduce Zoomer, a novel visual prompting mechanism designed to enhance MLLM performance while preserving essential visual details within token limits. Zoomer features three key innovations: a prompt-aware strategy that dynamically highlights relevant image regions, a spatial-preserving orchestration schema that maintains object integrity, and a budget-aware prompting method that balances global context with crucial visual details. Comprehensive evaluations across multiple datasets demonstrate that Zoomer consistently outperforms baseline methods, achieving up to a $26.9\%$ improvement in accuracy while significantly reducing token consumption.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5385

Loading