Balanced Multimodal Learning: An Integrated Framework for Multi-Task Learning in Audio-Visual Fusion

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: modal imbalance; audio-visual fusion; multi-task
Abstract: Multimodal learning integrates sensory information from various perspectives, providing significant advantages in different fields like sentiment analysis. However, recent studies have highlighted challenges associated with imbalanced contributions and varying convergence rates across different modalities. Neglecting these imbalances in joint-learning models compromises both information utilization and overall performance. We further find that neither advanced semantic representations nor complex deep networks effectively address these imbalances. To empirically examine these challenges, we approach them from an audio-visual multi-task perspective, focusing on two tasks: lip reading and sentiment analysis, and exploring the contributions of different modalities under varying scenarios. We introduce $\textit{BalanceMLA}$ in our work, a multimodal learning framework designed to dynamically balance and optimize each modality. This framework can independently adjust the objectives of each modality and adaptively control their optimization. Additionally, we propose a bilateral residual feature fusion and an adaptive weighted decision fusion strategy to dynamically manage these imbalances. We also introduce a dynamically generated class-level weighting scheme to cater to fine-grained tasks. Extensive experimental results validate the superiority of our model in addressing modality imbalances, showcasing both its effectiveness and versatility. Furthermore, experiments conducted under extreme noise conditions demonstrate that our model maintains high fusion efficiency and robustness, even in challenging environments.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3598
Loading