Abstract: Highlights•We propose a unified multimodal framework to streamline multimodal classification.•By generating image description knowledge to enrich the multimodal semantic space.•A contrastive learning is designed to achieve underlying semantics and interaction.•Developing a multimodal dual perception module to model congruity and incongruity.•Experimental results and analyses demonstrate the superiority of our model.
Loading