Abstract: The rapid expansion of multimodal models has surfaced formidable bottlenecks in computation, memory, and deployment, catalyzing the rise of Efficient Multimodal Learning (EML) as a pivotal research frontier. Despite intensive progress, a cohesive understanding of \textit{what}, \textit{how}, and \textit{where} efficiency is manifested across the learning stack remains fragmented. This survey systematizes the EML landscape by introducing the first structured, model-to-system taxonomy. We distill insights from over 300 seminal works into three hierarchical levels—\textit{model}, \textit{algorithm}, and \textit{system}—addressing architectural parsimony, execution refinement, and hardware-aware orchestration, respectively.
Moving beyond a purely categorical review, we offer a methodological synthesis of the vertical synergies between these layers, elucidating how cross-layer co-design resolves the fundamental ``Efficiency-Utility-Privacy'' trilemma. Through an integrative case study of Multimodal Large Language Models (MLLMs), we trace the field’s evolutionary trajectory from initial structural adjustments to modern full-stack resource orchestration. Furthermore, we provide a holistic discussion and application-specific optimization blueprints for diverse domains and posit a paradigm shift toward self-regulating intelligence, where efficiency is an intrinsic, emergent property of the model’s fundamental design rather than a post-hoc constraint. Finally, we present open challenges and future directions that will define the trajectory of EML research. This survey establishes a formal foundation for multimodal systems that are not only high-performing and generalizable but natively efficient and ready for ubiquitous deployment. We also maintain a Github repository to continuously update related work for research community.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Long_Chen8
Submission Number: 7191
Loading