From Models to Systems: A Comprehensive Survey of Efficient Multimodal Learning

Pan Wang; Siwei Song; Hui Ji; Siqi Cao; Heng Yu; Zhijian Liu; Huanrui Yang; Yingyan Celine Lin; Beidi Chen; Mohit Bansal; Xiaoming Liu; Pengfei Zhou; Ming-Hsuan Yang; Tianlong Chen; Jingtong Hu

From Models to Systems: A Comprehensive Survey of Efficient Multimodal Learning

Pan Wang, Siwei Song, Hui Ji, Siqi Cao, Heng Yu, Zhijian Liu, Huanrui Yang, Yingyan Celine Lin, Beidi Chen, Mohit Bansal, Xiaoming Liu, Pengfei Zhou, Ming-Hsuan Yang, Tianlong Chen, Jingtong Hu

Published: 15 May 2026, Last Modified: 15 May 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rapid expansion of multimodal models has surfaced formidable bottlenecks in computation, memory, and deployment, catalyzing the rise of Efficient Multimodal Learning (EML) as a pivotal research frontier. Despite intensive progress, a cohesive understanding of $\textit{what}$, $\textit{how}$, and $\textit{where}$ efficiency is manifested across the learning stack remains fragmented. This survey systematizes the EML landscape by introducing the first structured, model-to-system taxonomy. We distill insights from over 300 seminal works into three hierarchical levels—$\textit{model}$, $\textit{algorithm}$, and $\textit{system}$—addressing architectural parsimony, execution refinement, and hardware-aware orchestration, respectively. Moving beyond a purely categorical review, we offer a methodological synthesis of the vertical synergies between these layers, elucidating how cross-layer co-design contributes to the fundamental "Efficiency-Utility-Privacy'' trade-off. Through an integrative case study of Multimodal Large Language Models (MLLMs), we trace the field’s evolutionary trajectory from initial structural adjustments to modern full-stack resource orchestration. Furthermore, we provide a holistic discussion and application-specific optimization blueprints for diverse domains and posit a paradigm shift toward self-regulating intelligence, where efficiency is an intrinsic, emergent property of the model’s fundamental design rather than a post-hoc constraint. Finally, we present open challenges and future directions that will define the trajectory of EML research. This survey establishes a structured framework for multimodal systems that are not only high-performing and generalizable but natively efficient and ready for ubiquitous deployment. A continuously updated version is available at https://github.com/pwang322/Efficient-Multimodal-Learning-Survey.

Certifications: Survey Certification

Submission Type: Long submission (more than 12 pages of main content)

Code: https://github.com/pwang322/Efficient-Multimodal-Learning-Survey

Assigned Action Editor: ~Long_Chen8

Submission Number: 7191

Loading