Hierarchical Procedural Meta-Reasoning for Generalizable Multimodal Agents

Published: 02 Mar 2026, Last Modified: 31 Mar 2026Agentic AI in the Wild: From Hallucinations to Reliable Autonomy PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: sequential decision making, mobile agents, large multimodal models
Abstract: While multimodal agents can achieve strong performance through fine-tuning, their ability to generalize remains limited in complex real-world tasks such as mobile navigation, where diverse applications, frequent system changes, and customized workflows are common. We argue that a fundamental bottleneck lies in whether an agent possesses sufficient task-specific procedural knowledge to accomplish a given goal. In practice, due to the limited or outdated knowledge of the agent, the procedural steps it generates can be hallucinated and misaligned with the environment during execution. However, better procedural knowledge can be provided by the general capabilities of large language models, or obtained from additional external resources such as web search when necessary. Based on this view, we propose Procedure-Aware Multimodal Agent with Meta Reasoning, a framework that explicitly represents task knowledge as natural-language procedures and trains a procedure-aware grounded agent to condition its actions on this knowledge. By learning to leverage procedural knowledge from different sources, our approach enables robust and reliable generalization with reduced procedural hallucination across tasks, applications, interface versions, and multi-app workflows, achieving substantial improvements on challenging Android benchmarks.
Submission Number: 82
Loading