TL;DR: A framework that can play RDR2, Stardew Valley, Cities: Skylines and Dealer's Life2 and various software with unified interface
Abstract: Despite their success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory, Cradle is able to understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning and information retrieval, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games (Red Dead Redemption 2, Cities:Skylines, Stardew Valley and Dealer's Life 2), five software applications (Chrome, Outlook, Feishu, Meitu and CapCut), and a comprehensive benchmark, OSWorld. With a unified interface to interact with any software, Cradle greatly extends the reach of foundation agents thus paving the way for generalist agents.
Lay Summary: How can we enable AI agents to perform all kinds of computer tasks—not just browsing the web, but also playing video games and operating complex software? The Cradle framework offers an answer by allowing chatbot models like GPT-4o to use computers the same way humans do: by viewing the screen and controlling the keyboard and mouse.
Rather than relying on built-in shortcuts or special software access, Cradle harnesses the power of multimodal large language models to interpret screenshots and generate code that simulates human interactions. Comprising six key modules, Cradle enables models to observe ongoing activities, reflect on past actions, plan subsequent steps, and store useful skills for future tasks, thereby effectively managing challenging and intricate assignments.
Cradle has successfully completed long, complex missions in demanding games like Red Dead Redemption 2, Cities: Skylines, and Stardew Valley as well as executing various software tasks, like image and video editing. This marks a significant step toward building general-purpose AI agents that are adaptable, capable, and human-like in their digital interactions.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/BAAI-Agents/Cradle
Primary Area: Applications
Keywords: Foundation Agents, Large Multimodal Models, Decision-making, General Computer Control
Submission Number: 10484
Loading