Keywords: Learning Theory, Multimodal Learning, Loss Landscape Smoothness, Robustness
TL;DR: We propose a theoretical framework showing that multimodal learning not only outperforms unimodal approaches but also achieves a flatter minima for better generalization.
Abstract: A surge of recent advancements has consistently highlighted the superiority of multimodal learning over unimodal approaches across a variety of tasks.
However, the theoretical foundations elucidating this advantage remain underexplored: existing theoretical analyses are often constrained by tight assumptions, and lack empirical validation.
In this paper, we bridge this gap by proposing a novel theoretical framework grounded in convolutional smoothing, offering a new perspective on how multimodal learning contributes to a smoother loss landscape compared to unimodal learning.
Building upon this theoretical foundation, we introduce a simple yet effective distributional training strategy based on stochastic modality pairing instead of a fixed pairing; thus, further promoting a flatter landscape via convolutional smoothing.
Our empirical results across various multimodal datasets demonstrate that multimodal models not only achieve higher performance but also exhibit flatter loss landscape, which represent better generalization and robustness.
Primary Area: learning theory
Submission Number: 12001
Loading