MSN: Adaptive and Dynamic Multi-modal Shortcut Network Architecture for Latency-Aware Applications

Yifei Pu, Chi Wang, Xiaofeng Hou, Cheng Xu, Jiacheng Liu, Jing Wang, Minyi Guo, Chao Li

Published: 2024, Last Modified: 10 Jan 2026ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multi-modal neural networks have demonstrated exceptional performance by merging information across modalities, surpassing the state-of-the-art uni-modal DNNs. However, this accuracy improvement comes at the cost of increased computation, leading to higher inference latency. This defect significantly limits the practical value of multi-modal DNNs, especially for latency-aware applications. Therefore, we propose an adaptive and efficient multi-modal shortcut architecture called M2SN to reduce the execution latency with accuracy guarantees. It skips ineffective network layers to reduce computational costs as well as alleviate the overfitting problem adaptive to specific models and scenarios. The key contributions of M2SN are twofold: 1) We design and insert shortcuts into each uni-modal network to perform adaptive computing. 2) We design a navigator to dynamically choose the optimal shortcuts. Unlike previous approaches, M2SN features high generality as it does not rely on any prior knowledge. The experimental results show that M2SN can reduce 28.3% average latency while obtaining the same or higher accuracy compared with SOTA baselines.

External IDs:dblp:conf/icmcs/PuWHX0WGL24