Keywords: multimodal language models, knowledge distillation
TL;DR: Knowledge-guided distillation compresses multimodal LLMs, keeping 87% performance, speeding inference 1.4×, and cutting parameters 49%, enabling efficient, knowledge-rich reasoning.
Abstract: Contemporary Multimodal Large Language Models (MLLMs) demonstrate exceptional capabilities in synthesizing visual and linguistic information with external knowledge repositories for sophisticated reasoning applications. Nevertheless, their substantial computational requirements present significant obstacles for implementation in resource-constrained settings. This research presents a knowledge-guided distillation methodology that facilitates the transfer of reasoning capabilities from large, knowledge-enriched teacher networks to streamlined student frameworks. Our technique preserves 87.3\% of the teacher model's performance while achieving a 1.4$\times$ acceleration in inference speed and a 49\% reduction in parameter count. Evaluations on knowledge-enhanced visual question answering datasets demonstrate that our distillation approach surpasses conventional distillation methods by 0.4 percentage points while maintaining comparable factual accuracy. These findings establish a viable pathway for developing efficient MLLMs optimized for knowledge-intensive applications demanding real-time processing capabilities.
Submission Number: 5
Loading