Sparse MoE as a New Treatment: Addressing Forgetting, Fitting, Learning Issues in Multi-Modal Multi-Task Learning

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: multi-modal learning, multi-task learning, sparse mixture-of-experts
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose a systematic approach addressing the forgetting, fitting, and learning issues in multi-modal multi-task learning via Sparse MoE.
Abstract: Sparse Mixture-of-Experts (SMoE) is a promising paradigm that can be easily tailored for multi-task learning. Its conditional computing nature allows us to organically allocate relevant parts of a model for performant and efficient predictions. However, several under-explored pain points persist, especially when considering scenarios with both multiple modalities and tasks: 1 $\textit{{Modality Forgetting Issue.}}$ Diverse modalities may prefer conflicting optimization directions, resulting in ineffective learning or knowledge forgetting; 2 $\textit{{Modality Fitting Issue.}}$ Current SMoE pipelines select a fixed number of experts for all modalities, which can end up over-fitting to simpler modalities or under-fitting complex modalities; 3 $\textit{{Heterogeneous Learning Pace.}}$ The varied modality attributes, task resources ($\textit{i.e.,}$ the number of input samples), and task objectives usually lead to distinct optimization difficulties and convergence. Given these issues, there is a clear need for a systematic approach to harmonizing multi-model and multi-task objectives when using SMoE. We aim to address these pain points, and propose a new $\underline{S}$parse $\underline{M}$oE framework for $\underline{M}$ulti-$\underline{M}$odal $\underline{M}$ulti-task learning, $\textit{a.k.a.}$, $\texttt{SM$^4$}$, which ($1$) disentangles model spaces for different modalities to mitigate their optimization conflicts; ($2$) automatically determines the modality-specific model size ($\textit{i.e.}$, the number of experts) to improve fitting; and ($3$) synchronizes the learning paces of disparate modalities and tasks based on training dynamics in SMoE like the entropy of routing decisions. Comprehensive experiments validate the effectiveness of $\texttt{SM$^4$}$, which outperforms previous state-of-the-art across $3$ task groups and $11$ different modalities with a clear performance margin ($\textit{e.g.}$, $\ge 1.37\%$) and a substantial computation reduction ($46.49\% \sim 98.62\%$). Code is included in the supplement.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5688
Loading