Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models

Rahul Thapa; Kezhen Chen; Ian Connick Covert; Rahul Chalamala; Ben Athiwaratkun; Shuaiwen Leon Song; James Zou

Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models

Rahul Thapa, Kezhen Chen, Ian Connick Covert, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodel Language Model, Visual Instruction Tuning, Biomedical multimodal model, foundation model

TL;DR: Dragonfly surpasses existing vision transformers by zooming in beyond native image resolutions, excelling in fine-grained detail extraction and setting new benchmarks in general and biomedical tasks.

Abstract: Recent advancements in vision-language models (VLMs) have highlighted the benefits of processing images at higher resolutions and leveraging multi-crop features to retain native resolution details. However, current vision transformers (ViTs) often struggle to capture fine-grained details from non-dominant objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we push beyond the conventional high-resolution and multi-crop techniques by not only preserving but also zooming in past the native resolution of images. This enhancement allows our model to better extract fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we show that a simple mean-pooling aggregation over tokens is effective. Our model, Dragonfly, achieves competitive performance on general tasks such as ScienceQA and AI2D, and excels in tasks requiring fine-grained image understanding, including TextVQA and ChartQA. On average, across ten general-domain benchmarks, Dragonfly ranks at the top, outperforming models that are significantly larger or trained on much larger datasets. Notably, Dragonfly sets new benchmarks on several biomedical tasks, achieving 91.6\% accuracy on the SLAKE (compared to 84.8\% for Med-Gemini) and a 67.1\% token F1 score on Path-VQA (compared to 62.7\% for Med-PaLM M). On biomedical image captioning tasks, Dragonfly attains state-of-the-art results majority of the performance metrics.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12707

Loading