Collaborative Training of Tiny-Large Vision Language Models

Published: 01 Jan 2024, Last Modified: 08 Apr 2025ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recently, large vision language models (LVLMs) have advanced AI by integrating visual and linguistic data for tasks like visual conversation, image captioning, and visual question answering. Current LVLM research either scales up model size for performance or reduces parameters for limited computational resources. We believe both large and tiny models have unique strengths and that collaborative training yields better results than independent training. We propose Collaborative Training of Tiny-Large Vision Language Models (CTVLMs), a framework connecting large and tiny models via a projection layer and leveraging a synergistic training strategy. Our framework improves training efficiency by strengthening the interconnection between large and tiny models. Using the parameter efficiency of tiny models, we effectively align image-text features, then apply knowledge distillation to help large models better align cross-modal information. During fine-tuning, the large model's extensive knowledge enhances tiny model's performance. This collaborative approach allows models to adapt to various computational resources and outperforms existing methods in vision-language tasks.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview