Poly-Visual-Expert Vision-Language Models

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0
Research Area: LMs on diverse modalities and novel applications
Keywords: Vision-Language Models, Multi-modal Models
TL;DR: Poly-Visual-Expert Vision-Language Models
Abstract: Current large vision-language models (VLMs) frequently face challenges such as the limited capabilities of a single visual component and the excessive length of visual tokens. These issues can limit the model's ability to interpret complex visual information and over-lengthy contextual information accurately. Tackling these challenges is crucial for enhancing the performance and applicability of VLMs. This paper proposes leveraging the ensemble experts technique to synergize the capabilities of individual visual encoders, including those skilled in image-text matching, image segmentation, OCR, etc. This method introduces a fusion network that consolidates the outputs from different visual experts while bridging the gap between image encoders and pre-trained LLMs. In addition, we explore different positional encoding schemes to mitigate the waste of positional encoding caused by lengthy image feature sequences, effectively addressing the issue of position overflow and length limitations. For instance, in our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient 64 or even down to 1. Experimental results show that VLMs with multiple experts consistently outperform isolated visual encoders, with notable performance improvements as more experts are integrated. Our codes are available on our project website.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 742
Loading