Collaborative Training of Tiny-Large Vision Language Models

Shichen Lu; Longteng Guo; Wenxuan Wang; Zijia Zhao; Tongtian Yue; Jing Liu; Si Liu

Collaborative Training of Tiny-Large Vision Language Models

Shichen Lu, Longteng Guo, Wenxuan Wang, Zijia Zhao, Tongtian Yue, Jing Liu, Si Liu

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recently, large vision language models (LVLMs) have advanced AI by integrating visual and linguistic data for tasks like visual conversation, image captioning, and visual question answering. Current LVLM research either scales up model size for performance or reduces parameters for limited computational resources. We believe both large and tiny models have unique strengths and that collaborative training yields better results than independent training. We propose Collaborative Training of Tiny-Large Vision Language Models (CTVLMs), a framework connecting large and tiny models via a projection layer and leveraging a synergistic training strategy. Our framework improves training efficiency by strengthening the interconnection between large and tiny models. Using the parameter efficiency of tiny models, we effectively align image-text features, then apply knowledge distillation to help large models better align cross-modal information. During fine-tuning, the large model’s extensive knowledge enhances tiny model’s performance. This collaborative approach allows models to adapt to various computational resources and outperforms existing methods in vision-language tasks.

Primary Subject Area: [Content] Vision and Language

Relevance To Conference: We propose a collaborative framework significantly improves the alignment of multimodal information between large and tiny models We propose a novel collaborative training strategy, where the large model imparts its extensive knowledge to the tiny model, optimizing the learning process

Supplementary Material: zip

Submission Number: 2264

Loading