From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Open-vocabulary Grounded Situation Recognition
Abstract: Recent Multimodal Large Language Models (MLLMs) exhibit strong
zero-shot abilities but struggle with complex Grounded Situation
Recognition (GSR) and are resource-intensive for edge device deployment. Meanwhile, conventional GSR models often lack generalization ability, falling short in recognizing unseen and rare
situations. In this paper, we exploit transferring knowledge from
a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities, thereby introducing the task of Openvocabulary Grounded Situation Recognition (Ov-GSR). To achieve
this, we propose Multimodal Interactive Prompt Distillation (MIPD),
a novel framework that distills enriched multimodal knowledge
from the foundation model, enabling the student Ov-GSR model to
recognize unseen situations and be better aware of rare situations.
Specifically, the MIPD framework first leverages the LLM-based
Judgmental Rationales Generator (JRG) to construct positive and
negative glimpse and gaze rationales enriched with contextual
semantic information. The proposed scene-aware and instanceperception prompts are then introduced to align rationales with
visual information from the MLLM teacher via the Negative-Guided
Multimodal Prompting Alignment (NMPA) module, effectively capturing holistic and perceptual multimodal knowledge. Finally, the
aligned multimodal knowledge is distilled into the student Ov-GSR
model, providing a stronger foundation for generalization that enhances situation understanding, bridges the gap between seen and
unseen scenarios, and mitigates prediction bias in rare cases. We evaluate MIPD on the refined Ov-SWiG dataset, achieving superior performance on seen, rare, and unseen situations, and further
demonstrate improved unseen detection on the HICO-DET dataset.
Loading