Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Haotian Zhang; Haoxuan You; Philipp Dufter; Bowen Zhang; Chen Chen; Hong-You Chen; Tsu-Jui Fu; William Yang Wang; Shih-Fu Chang; Zhe Gan; Yinfei Yang

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: LMs on diverse modalities and novel applications

Keywords: Multimodal Large Language Model, Referring, Grounding

TL;DR: Ferret-v2, a significant upgrade to Ferret, which excels in detailed vision-related tasks without compromising their proficiency in global understanding.

Abstract: While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 45

Loading