Keywords: Multimodal Learning, Modality Imbalance, Classifier Freezing, Alternating Training
TL;DR: We propose CCAT, a framework using a frozen unbiased classifier and modality-specific LoRA adapters during alternating training to prevent modality imbalance, significantly outperforming state-of-the-art methods.
Abstract: Modality imbalance, driven by divergent convergence dynamics across modalities, critically limits multimodal model performance. Although alternating training methods mitigate encoder-level interference, they fail to prevent dominance of classifiers by faster-converging modalities, suppressing contributions from weaker ones. To address this core limitation, we propose Classifier-Constrained Alternating Training (CCAT). Our framework first pre-trains an unbiased cross-modal classifier using bidirectional cross-attention and a regularization term that constrains modality contribution differences. This classifier is then frozen as a stable decision anchor during subsequent training, preventing bias toward any modality. To preserve modality-specific features while leveraging this anchor, we integrate modality-specific Low-Rank Adaptation (LoRA) modules into the classifier. During alternating training, CCAT updates only the encoder of the active modality and its corresponding LoRA parameters. Furthermore, a sample-level imbalance detection mechanism quantifies contribution disparities, enabling targeted optimization of severely imbalanced samples to bolster weaker modalities. Extensive experiments across multiple benchmarks demonstrate CCAT’s consistent superiority: it achieves accuracy gains of +1.35% on CREMA-D, +6.76% on Kinetic-Sound and +1.92% on MVSA over state-of-the-art methods, validating the framework’s efficacy in learning balanced, robust multimodal representations.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18361
Loading