## M$^3$SAT: A Sparsely Activated Transformer for Efficient Multi-Task Learning from Multiple Modalities

Abstract: Multi-modal multi-task learning (M$^2$TL) aims to discover the implicit correspondences among heterogeneous modalities and tasks, which is common in real-world applications like autonomous driving and robotics control. Current single-model solutions for M$^2$TL usually fall short in several aspects. The shared backbone between the modalities is prone to overfitting the simpler modality, while jointly optimizing the tasks suffers from unstable training due to the gradient conflicts across tasks. On the other hand, designing a separate model for each task and modality can avoid the above problems but leads to prohibitively expensive computation and memory consumption, rendering this approach unrealistic. In this work, we propose M$^3$SAT, a sparsely activated transformer for efficient M$^2$TL. The proposed framework tailors the mixture-of-experts (MoEs) into both the self-attention and the feed-forward networks (FFN) of a transformer backbone. It adopts the routing policy to assign attention-heads and FFN experts during training, which effectively disentangles the parameter space to prevent training conflicts among diverse modalities and tasks. Meanwhile, disentangled parameter space also restrains the problem of simple modal prone to overfitting. Sparsely activating the transformer also enables efficient computation for each input sample. Through comprehensive evaluation, we demonstrate the effectiveness of our M$^3$SAT: a remarkable performance margin (\textit{e.g.}, $\ge 1.37\%$) is achieved over the dense models with the same computation cost. More importantly, M$^3$SAT can achieve the above performance improvements with a fraction of the computation cost -- our computation is only $1.38\% \sim 53.51\%$ of that of the SOTA methods. Our code will be released upon acceptance.