Abstract: Recent advancements in multimodal models have showed promise, yet their dependency on consistent modalities from training to inference limits their application. While existing methods mitigate the problem through reconstructing the missing modalities, they increase unnecessary computational cost, which could be just as critical, especially for large, deployed systems. To address these issues, we propose a novel multimodal guidance network that promotes knowledge sharing during training, taking advantage of the multimodal representations to train better single-modality models for inference. Real-life experiment in violence detection shows that our proposed framework trains single-modality models that significantly outperform its traditionally trained counterparts while maintaining the same inference cost. Code will be made public upon acceptance.
Paper Type: short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: NLP engineering experiment
Languages Studied: English
0 Replies
Loading