Abstract: Multimodal recognition can achieve enhanced performance by leveraging the complementary information from different modalities. However, in real-world scenarios, multimodal samples often express discordant semantic meanings across modalities, lacking evident complementary information. Unlike humans who can easily understand the intrinsic semantic information of these semantically discordant samples, existing multimodal recognition models show poor performance on them. With the motivation of improving the robustness of multimodal recognition models in practical scenarios, this work poses a new challenge in multimodal recognition, which is coined as Semantic Discordance Understanding. Unlike existing works only focusing on detecting semantically discordant samples as noisy data, this new challenge requires deep models to follow humans' ability in understanding the inherent semantic meanings of semantically discordant samples. To address this challenge, we further propose the Progressive Multimodal Pivot Learning (PMPL) approach by introducing a learnable pivot memory to explore the inherent semantics meaning hidden under discordant modalities. To this end, our approach inserts Pivot Memory Learning (PML) modules into multiple layers of unimodal foundation models to progressively trade-off the conflict information across modalities. By introducing the multimodal pivot learning paradigm for multimodal recognition, the proposed PMPL approach can alleviate the negative effect of semantic discordance caused by the cross-modal information exchange mechanism of existing multimodal recognition models. Experiments on different benchmarks validate the superiority of our approach. Code is available at https://github.com/tiggers23/PMPL.
Loading