M$^3$SAT: A Sparsely Activated Transformer for Efficient Multi-Task Learning from Multiple Modalities

Jie Peng; Tianlong Chen; Ruida Zhou; Jianmin Ji; Yanyong Zhang; Zhangyang Wang

M$^3$SAT: A Sparsely Activated Transformer for Efficient Multi-Task Learning from Multiple Modalities

Jie Peng, Tianlong Chen, Ruida Zhou, Jianmin Ji, Yanyong Zhang, Zhangyang Wang

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone

Keywords: multi-task learning, multimodal learning, transformer, mixture of experts

TL;DR: Adapt the mixture-of-experts (MoEs) into both the self-attention and the feed-forward networks (FFN) of a transformer backbone for efficient multi-task learning from multiple modalities.

Abstract: Multi-modal multi-task learning (M$^2$TL) aims to discover the implicit correspondences among heterogeneous modalities and tasks, which is common in real-world applications like autonomous driving and robotics control. Current single-model solutions for M$^2$TL usually fall short in several aspects. The shared backbone between the modalities is prone to overfitting the simpler modality, while jointly optimizing the tasks suffers from unstable training due to the gradient conflicts across tasks. On the other hand, designing a separate model for each task and modality can avoid the above problems but leads to prohibitively expensive computation and memory consumption, rendering this approach unrealistic. In this work, we propose M$^3$SAT, a sparsely activated transformer for efficient M$^2$TL. The proposed framework tailors the mixture-of-experts (MoEs) into both the self-attention and the feed-forward networks (FFN) of a transformer backbone. It adopts the routing policy to assign attention-heads and FFN experts during training, which effectively disentangles the parameter space to prevent training conflicts among diverse modalities and tasks. Meanwhile, disentangled parameter space also restrains the problem of simple modal prone to overfitting. Sparsely activating the transformer also enables efficient computation for each input sample. Through comprehensive evaluation, we demonstrate the effectiveness of our M$^3$SAT: a remarkable performance margin (\textit{e.g.}, $\ge 1.37\%$) is achieved over the dense models with the same computation cost. More importantly, M$^3$SAT can achieve the above performance improvements with a fraction of the computation cost -- our computation is only $1.38\% \sim 53.51\%$ of that of the SOTA methods. Our code will be released upon acceptance.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

15 Replies

Loading