Hybrid Self-evolving Structured Memory for GUI Agents

ACL ARR 2026 January Submission7893 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Multimodal Large Language Models, GUI Agents, Memory-Augmented Agents, Graph-based Retrieval, Retrieval-Augmented Generation, Multimodal Reasoning, Knowledge Graphs
Abstract: The remarkable progress of vision–language models (VLMs) has enabled GUI agents to interact with computers in a human-like manner. Yet real-world computer-use tasks remain difficult due to long-horizon workflows, diverse interfaces, and frequent intermediate errors. Prior work equips agents with external memory built from large collections of trajectories, but relies on flat retrieval over discrete summaries or continuous embeddings, falling short of the structured organization and self-evolving characteristics of human memory. Inspired by the brain, we propose Hybrid Self-evolving Structured Memory (HyMEM), a graph-based memory that couples discrete high-level symbolic nodes with continuous trajectory embeddings. HyMEM maintains a graph structure to support multi-hop retrieval, self-evolution via node update operations, and on-the-fly working-memory refreshing during inference. Extensive experiments show that HyMEM consistently improves open-source GUI agents, enabling 7B/8B backbones to match or surpass strong closed-source models; notably, it boosts Qwen2.5-VL-7B by +22.5% and outperforms Gemini2.5-Pro-Vision and GPT-4o. Our code and data will be publicly released.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: Autonomous agents, LLM agents, tool use, multi-modal agents, planning in agents, environment interaction, agent memory, agent evaluation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7893
Loading