Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models

ICLR 2025 Conference Submission12707 Authors

28 Sept 2024 (modified: 21 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodel Language Model, Visual Instruction Tuning, Biomedical multimodal model, foundation model
TL;DR: Dragonfly surpasses existing vision transformers by zooming in beyond native image resolutions, excelling in fine-grained detail extraction and setting new benchmarks in general and biomedical tasks.
Abstract: Recent advancements in vision-language models (VLMs) have highlighted the benefits of processing images at higher resolutions and leveraging multi-crop features to retain native resolution details. However, current vision transformers (ViTs) often struggle to capture fine-grained details from non-dominant objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we push beyond the conventional high-resolution and multi-crop techniques by not only preserving but also zooming in past the native resolution of images. This enhancement allows our model to better extract fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we show that a simple mean-pooling aggregation over tokens is effective. Our model, Dragonfly, achieves competitive performance on general tasks such as ScienceQA and AI2D, and excels in tasks requiring fine-grained image understanding, including TextVQA and ChartQA. On average, across ten general-domain benchmarks, Dragonfly ranks at the top, outperforming models that are significantly larger or trained on much larger datasets. Notably, Dragonfly sets new benchmarks on several biomedical tasks, achieving 91.6\% accuracy on the SLAKE (compared to 84.8\% for Med-Gemini) and a 67.1\% token F1 score on Path-VQA (compared to 62.7\% for Med-PaLM M). On biomedical image captioning tasks, Dragonfly attains state-of-the-art results majority of the performance metrics.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12707
Loading