Keywords: Web Agent, Digital Agent, World Model
Abstract: Large Language Models (LLMs) have recently advanced to power autonomous web agents. However, they still struggle when browsing through long-horizon tasks, often making mistakes such as repeating unnecessary actions. An LLM-based agent might fail to recognize that an item has already been added to a shopping cart and attempt to click the 'add' button again. In contrast, humans easily identify when an item has been added, as they maintain an awareness of the task progression when interacting with the web interface, rarely repeating such actions. This distinction arises from the presence of a world model in humans (i.e., an internal representation that simulates interactions with the environment) and its absence in current LLM-based agents. Realizing this absence, we propose World-Model-Augmented (WMA) Web Agents, which integrate world models to enhance the decision-making capabilities of LLM-based agents. We introduce a novel mechanism allowing agents to focus on state transition information for making informed action choices. Evaluations on WebArena benchmark prove that WMA Web Agent outperforms existing baselines, such as the Tree Search Agent, by improving action-selection accuracy and reducing errors in web navigation tasks. This work presents the first successful integration of world models in LLM-based web agents, suggesting a guidance for effective automation in dynamic web environments.
Submission Number: 81
Loading