Class-Conditional Autoencoders with Adversarial Alignment for Multimodal Fusion

Class-Conditional Autoencoders with Adversarial Alignment for Multimodal Fusion

ICLR 2026 Conference Submission18576 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal;ML: Multimodal Learning;dversarial Learning

TL;DR: his paper proposes a model-free, category-aware dynamic fusion framework based on comprehensive data-driven approaches.

Abstract: Multimodal learning has advanced rapidly with large-scale transformers, but often requires heavy computation and lacks clear theoretical grounding. We propose a lightweight yet robust framework for multimodal fusion that unifies efficiency with theoretical guarantees. At its backbone lies a Class-Conditional Autoencoder (CCAE), which maps modality-specific inputs into a class-aware latent space. Building upon this, our Discriminative Embedding Framework (DEF) incorporates homologous and reconstruction losses to contract intra-class variance while preserving semantic fidelity, producing embeddings that are compact and discriminative. To address distributional inconsistencies across modalities, we introduce the Adversarial Alignment Framework (AAF), which dynamically weights modality contributions and aligns fused embeddings with modality-specific distributions using a Wasserstein objective. Together, DEF and AAF form a cohesive framework that explains why consistency and alignment emerge from a unified optimization perspective. Extensive experiments on machine translation (How2, Multi30k) and emotion recognition (IEMOCAP, MOSEI) demonstrate that our approach consistently outperforms strong baselines, including Transformer, MulT, and MISA, while operating with much lower FLOPs.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 18576

Loading