Poly-Visual-Expert Vision-Language Models

Xiaoran Fan; Tao Ji; 江常皓; Shuo Li; Senjie Jin; Sirui Song; Junke Wang; Boyang Hong; Lu Chen; Guodong Zheng; Ming Zhang; Huangcaishuang; Rui Zheng; Zhiheng Xi; Yuhao Zhou; Shihan Dou; Junjie Ye; Hang Yan; Tao Gui; Qi Zhang; Xipeng Qiu; Xuanjing Huang; Zuxuan Wu; Yu-Gang Jiang

Poly-Visual-Expert Vision-Language Models

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: LMs on diverse modalities and novel applications

Keywords: Vision-Language Models, Multi-modal Models

TL;DR: Poly-Visual-Expert Vision-Language Models

Abstract: Current large vision-language models (VLMs) frequently face challenges such as the limited capabilities of a single visual component and the excessive length of visual tokens. These issues can limit the model's ability to interpret complex visual information and over-lengthy contextual information accurately. Tackling these challenges is crucial for enhancing the performance and applicability of VLMs. This paper proposes leveraging the ensemble experts technique to synergize the capabilities of individual visual encoders, including those skilled in image-text matching, image segmentation, OCR, etc. This method introduces a fusion network that consolidates the outputs from different visual experts while bridging the gap between image encoders and pre-trained LLMs. In addition, we explore different positional encoding schemes to mitigate the waste of positional encoding caused by lengthy image feature sequences, effectively addressing the issue of position overflow and length limitations. For instance, in our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient 64 or even down to 1. Experimental results show that VLMs with multiple experts consistently outperform isolated visual encoders, with notable performance improvements as more experts are integrated. Our codes are available on our project website.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 742

Loading

Poly-Visual-Expert Vision-Language Models

Xiaoran Fan, Tao Ji, 江常皓, Shuo Li, Senjie Jin, Sirui Song, Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng, Ming Zhang, Huangcaishuang, Rui Zheng, Zhiheng Xi, Yuhao Zhou, Shihan Dou, Junjie Ye, Hang Yan, Tao Gui, Qi Zhang et al. (4 additional authors not shown)