APT: Architectural Planning and Text-to-Blueprint Construction Using Large Language Models for Open-World Agents
Keywords: Large Language Model Planning, Open-world Agent, Multimodal Language Model, Minecraft
TL;DR: APT is a multimodal autonomous agent framework that leverages large language models to generate architectural blueprints and construct complex structures using spatial reasoning, enhanced by memory and reflection, in few-shot scenarios.
Abstract: We present APT, an advanced Large Language Model
(LLM)-driven framework that enables autonomous agents
to construct complex and creative structures within the
Minecraft environment. Unlike previous approaches that primarily concentrate on skill-based open-world tasks or rely
on image-based diffusion models for generating voxel-based
structures, our method leverages the intrinsic spatial reasoning capabilities of LLMs. By employing chain-of-thought
decomposition along with multimodal inputs (textual and
visual), the framework generates detailed architectural layouts and blueprints that the agent can execute under zeroshot or few-shot learning scenarios. Our agent incorporates
both memory and reflection modules to facilitate lifelong
learning, adaptive refinement, and error correction throughout the building process. To rigorously evaluate the agent’s
performance in this emerging research area, we introduce
a comprehensive benchmark consisting of diverse construction tasks designed to test creativity, spatial reasoning, adherence to in-game rules, and the effective integration of
multimodal instructions. Experimental results using various
GPT-based LLM backends and agent configurations demonstrate the agent’s capacity to accurately interpret extensive
instructions involving numerous items, their positions, and
orientations. The agent successfully produces complex structures complete with internal functionalities such as Redstonepowered systems. A/B testing indicates that the inclusion of
a memory module leads to a significant increase in performance, emphasizing its role in enabling continuous learning
and the reuse of accumulated experience. Additionally, the
agent’s unexpected emergence of scaffolding behavior highlights the potential of future LLM-driven agents to utilize
subroutine planning and leverage emergence ability of LLMs
to autonomously develop human-like problem-solving techniques.
Submission Number: 6
Loading