Collaborative Training of Tiny-Large Vision Language Models

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recently, large vision language models (LVLMs) have advanced AI by integrating visual and linguistic data for tasks like visual conversation, image captioning, and visual question answering. Current LVLM research either scales up model size for performance or reduces parameters for limited computational resources. We believe both large and tiny models have unique strengths and that collaborative training yields better results than independent training. We propose Collaborative Training of Tiny-Large Vision Language Models (CTVLMs), a framework connecting large and tiny models via a projection layer and leveraging a synergistic training strategy. Our framework improves training efficiency by strengthening the interconnection between large and tiny models. Using the parameter efficiency of tiny models, we effectively align image-text features, then apply knowledge distillation to help large models better align cross-modal information. During fine-tuning, the large model’s extensive knowledge enhances tiny model’s performance. This collaborative approach allows models to adapt to various computational resources and outperforms existing methods in vision-language tasks.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: We propose a collaborative framework significantly improves the alignment of multimodal information between large and tiny models We propose a novel collaborative training strategy, where the large model imparts its extensive knowledge to the tiny model, optimizing the learning process
Supplementary Material: zip
Submission Number: 2264
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview