Abstract: Highlights•A unified multimodal classification framework that can handle various multimodal classification tasks.•Flexibly process data from multiple modalities, including images, texts, audio, and videos.•Metric-based triplet learning to extract intra-modal relationships in every modality.•Contrastive pairwise learning to capture inter-modal relationships across multiple modalities.
Loading