A Survey on Foundations and Frontiers of Multimodal Agentic Frameworks: Techniques and Applications

Neel Mokaria; Rishie Raj; Dheeraj Baiju; Xiaoqian Shen; Shraman Pramanick; Kevin Qinghong Lin; Arda Senocak; Mike Zheng Shou; Philip Torr; Mohamed Elhoseiny; Yapeng Tian; Ruohan Gao; Salman Khan; Sayan Nag; Sanjoy Chowdhury; Dinesh Manocha

A Survey on Foundations and Frontiers of Multimodal Agentic Frameworks: Techniques and Applications

Neel Mokaria, Rishie Raj, Dheeraj Baiju, Xiaoqian Shen, Shraman Pramanick, Kevin Qinghong Lin, Arda Senocak, Mike Zheng Shou, Philip Torr, Mohamed Elhoseiny, Yapeng Tian, Ruohan Gao, Salman Khan, Sayan Nag, Sanjoy Chowdhury, Dinesh Manocha

Published: 26 May 2026, Last Modified: 26 May 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Advances in large language models (LLMs) have fueled a wave of research into agency: the ability to reason, plan, and act. This effort has produced agentic frameworks that orchestrate perception, memory, and decision-making around powerful LLM backbones. With the advent of large multimodal models (LMMs), these systems can process and integrate diverse modalities, including images, audio, and video, thereby improving their real-world applicability. Yet, while surveys of LLM-based agents exist, the role of multimodality in shaping agency has not been systematically examined in recent years. This survey fills the gap by analyzing the impact of multimodality across the core functional modules of the agentic framework: perception, reasoning, planning, memory, and action. Using this lens, we trace the evolution from text-centric agents to multimodal frameworks, examine how modalities are integrated through delegated, late-fusion, and early-fusion architectures, and assess the emergence of agentic behaviors enabled by grounded perception and multimodal reasoning. We organize existing work through a modality-centric taxonomy that links architectural design choices to agent capabilities. Moreover, we review multimodal agentic systems across various application domains, including Robotics, GUI & Web Navigation, Multimedia Content Generation & Editing, and Long-form Video Understanding & Retrieval. Beyond capabilities, we analyze performance across these settings and discuss efficiency-scalability trade-offs, including training and inference costs, latency, and deployment constraints. By focusing on the impact of multimodality in agentic design, we aim to identify key gaps and chart a roadmap toward robust and general-purpose intelligent systems.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Yaodong_Yang1

Submission Number: 7032

Loading