A Single Transformer for Scalable Vision-Language Modeling

Yangyi Chen; Xingyao Wang; Hao Peng; Heng Ji

A Single Transformer for Scalable Vision-Language Modeling

Yangyi Chen, Xingyao Wang, Hao Peng, Heng Ji

Published: 13 Nov 2024, Last Modified: 13 Nov 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We present SOLO, a single transformer for Scalable visiOn-Language mOdeling. Current large vision-language models (LVLMs) such as LLaVA mostly employ heterogeneous architectures that connect pre-trained visual encoders with large language models (LLMs) to facilitate visual recognition and complex reasoning. Although achieving remarkable performance with relatively lightweight training, we identify four primary scalability limitations: (1) The visual capacity is constrained by pre-trained visual encoders, which are typically an order of magnitude smaller than LLMs. (2) The heterogeneous architecture complicates the use of established hardware and software infrastructure. (3) Study of scaling laws on such architecture must consider three separate components — visual encoder, connector, and LLMs, which complicates the analysis. (4) The use of existing visual encoders typically requires following a pre-defined specification of image inputs pre-processing, for example, by reshaping inputs to fixed-resolution square images. This inflexibility can create bottlenecks and impede scalability. A unified single Transformer architecture, like \approach, effectively addresses these scalability concerns in LVLMs; however, its limited adoption in the modern context likely stems from the absence of reliable training recipes that balance both modalities and ensure stable training for billion-scale models. In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM with the single Transformer architecture using moderate academic resources (8 x A100 80GB GPUs). The training recipe involves initializing from LLMs, sequential pre-training on ImageNet and web-scale data, and instruction fine-tuning on our curated high-quality datasets. On extensive evaluation, SOLO demonstrates performance comparable to LLaVA-v1.5-7B, particularly excelling in visual mathematical reasoning.

Submission Length: Regular submission (no more than 12 pages of main content)

Code: https://github.com/Yangyi-Chen/SOLO

Assigned Action Editor: ~Sanghyuk_Chun1

Submission Number: 3133

Loading