Long-Context Vision Large Language Models: Empirical Insights and A Baseline

Yongshuo Zong; Ismail Elezi; Yongxin Yang; Jiankang Deng; Timothy Hospedales

Long-Context Vision Large Language Models: Empirical Insights and A Baseline

Yongshuo Zong, Ismail Elezi, Yongxin Yang, Jiankang Deng, Timothy Hospedales

Published: 18 Jun 2024, Last Modified: 16 Jul 2024LCFM 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-langauge Models, Long-Context, Multimodal

TL;DR: We give empricial insights on long-context vision-langaguge tasks and provide a baseline method.

Abstract: The development of long-context large language models (LLMs) has attracted significant interest. However, progress in advancing long-context vision large language models (VLLMs) falls behind, despite their vast potential in applications like high-resolution input, multimodal in-context learning, multi-image understanding, and video understanding. In this paper, we present an empirical study to identify major challenges in developing long-context VLLMs and present a simple yet effective baseline for long-context tasks. By captioning the images separately and aggregating the captions as inputs, we directly alleviate the input length issue and also show that it outperforms other context extension or token reduction strategies effectively.

Submission Number: 11

Loading