Feature-Level Knowledge Distillation from LMM for Enhanced Image Classification

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mutli modal, large multimodal model, CNN, knowledge distillation, LLaVA, ResNet, CLIP
TL;DR: We propose a knowledge distillation framework that transfers LMM-generated text embeddings into ResNet-50, enabling richer visual representations and improving classification accuracy without requiring LMMs at inference.
Abstract: Large Multimodal models (LMMs) leverage a vast number of parameters and large-scale training data to acquire extensive knowledge and exhibit strong reasoning capabilities. However, despite their generality, they often fall short of surpassing the performance of vision models that are specialized for traditional vision-centric tasks. Although recent efforts have been made toward developing smaller language models, they remain insufficient for visual reasoning in environments constrained by memory and communication resources. In this study, we investigate the transfer of prior knowledge from LMMs into vision models, and observe notable improvements in performance. Our experiments highlight the role of LMM-generated text in enhancing vision model training, providing new insights into improving vision models through multimodal knowledge transfer.
Submission Number: 98
Loading