CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang; Qingsong Lv; Wenmeng Yu; Wenyi Hong; Ji Qi; Yan Wang; Junhui Ji; Zhuoyi Yang; Lei Zhao; Song XiXuan; Jiazheng Xu; Keqin Chen; Bin Xu; Juanzi Li; Yuxiao Dong; Ming Ding; Jie Tang

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, Jiazheng Xu, Keqin Chen, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Learning, Vision and Language, Representation Learning, Large Language Model

TL;DR: We introduce CogVLM, a powerful open-source visual language foundation model.

Abstract:

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular \emph{shallow alignment} method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables a deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 17 classic cross-modal benchmarks, including 1) image captioning datasets: NoCaps, Flicker30k, 2) VQA datasets: OKVQA, TextVQA, OCRVQA, ScienceQA, 3) LVLM benchmarks: MM-Vet, MMBench, SEED-Bench, LLaVABench, POPE, MMMU, MathVista, 4) visual grounding datasets: RefCOCO, RefCOCO+, RefCOCOg, Visual7W. Codes and checkpoints are available at Github.

Primary Area: Machine vision

Submission Number: 9852

Loading