Keywords: multimodal;ML: Multimodal Learning;dversarial Learning
TL;DR: his paper proposes a model-free, category-aware dynamic fusion framework based on comprehensive data-driven approaches.
Abstract: Large-scale multimodal transformers excel at cross-modal reasoning but incur prohibitive computational costs and lack theoretical grounding. We propose **DEF+AAF**, combining *Discriminative Embedding (DEF)* with *Adversarial Alignment (AAF)* to achieve provably robust multimodal fusion. We prove that class-conditional variance contraction + Wasserstein barycenter alignment provides a tighter generalization bound (**Theorem 3**) than standard contrastive methods, reducing expected error by $O(\sqrt{M/N})$ where $M$ is modality count. On emotion recognition (IEMOCAP, MOSEI) and translation (Multi30k, How2), DEF+AAF matches transformer baselines at 2.4× fewer parameters and 1.6× lower FLOPs, with +8.4% robustness gain under 50% missing modalities.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 18576
Loading