From Models to Systems: A Comprehensive Survey of Efficient Multimodal Learning

Published: 15 May 2026, Last Modified: 15 May 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The rapid expansion of multimodal models has surfaced formidable bottlenecks in computation, memory, and deployment, catalyzing the rise of Efficient Multimodal Learning (EML) as a pivotal research frontier. Despite intensive progress, a cohesive understanding of $\textit{what}$, $\textit{how}$, and $\textit{where}$ efficiency is manifested across the learning stack remains fragmented. This survey systematizes the EML landscape by introducing the first structured, model-to-system taxonomy. We distill insights from over 300 seminal works into three hierarchical levels—$\textit{model}$, $\textit{algorithm}$, and $\textit{system}$—addressing architectural parsimony, execution refinement, and hardware-aware orchestration, respectively. Moving beyond a purely categorical review, we offer a methodological synthesis of the vertical synergies between these layers, elucidating how cross-layer co-design contributes to the fundamental "Efficiency-Utility-Privacy'' trade-off. Through an integrative case study of Multimodal Large Language Models (MLLMs), we trace the field’s evolutionary trajectory from initial structural adjustments to modern full-stack resource orchestration. Furthermore, we provide a holistic discussion and application-specific optimization blueprints for diverse domains and posit a paradigm shift toward self-regulating intelligence, where efficiency is an intrinsic, emergent property of the model’s fundamental design rather than a post-hoc constraint. Finally, we present open challenges and future directions that will define the trajectory of EML research. This survey establishes a structured framework for multimodal systems that are not only high-performing and generalizable but natively efficient and ready for ubiquitous deployment. A continuously updated version is available at https://github.com/pwang322/Efficient-Multimodal-Learning-Survey.
Certifications: Survey Certification
Submission Type: Long submission (more than 12 pages of main content)
Code: https://github.com/pwang322/Efficient-Multimodal-Learning-Survey
Assigned Action Editor: ~Long_Chen8
Submission Number: 7191
Loading