Multimodal Commonsense Knowledge Distillation for Visual Question Answering

Shuo Yang, Caren Han, Siwen Luo

Published: 31 Mar 2025, Last Modified: 16 Apr 2025AAAI 2025EveryoneCC BY 4.0

Abstract: Existing Multimodal Large Language Models (MLLMs) and Visual Language Pretrained Models (VLPMs) have shown remarkable performances in general Visual Question An- swering (VQA). However, these models struggle with VQA questions that require external commonsense knowledge due to the challenges in generating high-quality prompts and the high computational costs of fine-tuning. In this work, we propose a novel graph-based multimodal commonsense knowledge distillation framework that constructs a unified relational graph over commonsense knowledge, visual ob- jects and questions through a Graph Convolutional Network (GCN) following a teacher-student environment. This pro- posed framework is flexible with any type of teacher and student models without further fine-tuning, and has achieved competitive performances on the ScienceQA dataset. The code is in https://github.com/adlnlp/MCKDVQA.