Cross-Aligned Fusion For Multimodal Understanding

Abhishek Rajora, Shubham Gupta, Suman Kundu

Published: 01 Jan 2025, Last Modified: 16 Oct 2025WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recent multimodal frameworks often grapple with semantic misalignment and noise, impeding effective integration of diverse modalities. In order to solve this problem, this study presents CaMN (Cross-aligned Multimodal Network), a framework designed to enhance multimodal understanding through a robust cross-alignment mechanism. Unlike conventional fusion methods, our framework aligns features extracted from images, text, and graphs via a tailored loss function, enabling seamless integration and exploitation of complementary information. Leveraging Abstract Meaning Representation (AMR), we extract intricate semantic structures from textual data, enriching the multi-modal representation with contextual depth. Furthermore, to enhance robustness, we employ a masked autoencoder to simulate noise-independent feature space. Through comprehensive evaluation on the crisisMMD dataset, CaMN demonstrates superior performance in crisis event classification tasks, highlighting its potential in advancing multimodal understanding across diverse domains. Our code is available at https://github.com/brillard1/CaMN.