Multimodal LLM Agents: Exploring LLM interactions in Software, Web and Operating Systems

Multimodal LLM Agents: Exploring LLM interactions in Software, Web and Operating Systems

UIUC Spring 2025 CS598 LLM Agent Workshop Submission9 Authors

17 Apr 2025 (modified: 20 Apr 2025)UIUC Spring 2025 CS598 LLM Agent Workshop SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Agents

Abstract: The digital world operates through multimodal interactions. Yet, current large language model (LLM) agents remain constrained by approaches that convert visual, auditory, and system-level data into lossy textual proxies. This introduces noise and limits the ability of agents to leverage holistic context when making decisions in digital environments. Although recent advancements in multimodal models—such as Flamingo and GPT-4 Vision—demonstrate impressive capabilities in vision-language tasks, their potential as agents capable of decision-making and task execution based on multimodal inputs remains underexplored. In this survey, we explore the design, evaluation, and capabilities of multimodal LLM agents, with a focus on interactions within software environments such as web browsers and operating system interfaces. We analyze recent advancements in multimodal integration within agentic systems, investigate multimodal tool orchestration frameworks, and explore interactive agents that integrate human feedback to guide decision-making. Our work seeks to highlight the potential of multimodal agents in the development of autonomous applications that navigate and interact with the digital world in human-like ways.

Submission Number: 9

Loading