HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models

Published: 11 Mar 2024, Last Modified: 26 Apr 2024LLMAgents @ ICLR 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Planners, Generalist Agents, Embodied Agents
TL;DR: An embodied agent with a memory of language-program pairs for executing complex tasks across multiple domains. It achieves few-shot state-of-the-art performance across four benchmarks, all with a single agent and without in-domain training.
Abstract: Methods for developing instructable embodied artificial agents typically train distinct models for each application and language domain to map instructions to the corresponding actions and task plans. Here we explore the feasibility of developing a versatile “generalist” instructable agent capable of operating across a broad spectrum of tasks, language domains, and environments, with a single model. Recent research on instructable agents has used memory-augmented Large Language Models (LLMs) as task planners, a technique that retrieves language-program examples relevant to the input instruction and uses them as in-context examples in the LLM prompt to improve the performance of the LLM in inferring the correct action and task plans. Our approach, HELPER-X, expands such external language-program memory with a wide range of examples and prompt templates, while also extending the agent's action API. This expansion of a shared unified memory enables the agent to work across the domains of executing plans from dialogue, natural language instruction following, active question asking, and commonsense room reorganization. We evaluate the agent on four diverse interactive visual-language embodied agent benchmarks: ALFRED, TEACh, DialFRED, and the Tidy Task. These benchmarks vary significantly in terms of input instructions, question-asking capabilities, task structures, and environmental settings. HELPER-X achieves few-shot, state-of-the-art performance across these benchmarks using a single agent, without requiring in-domain training, and remains competitive with agents that have undergone in-domain training. Our work demonstrates the potential of memory-augmented large language models to support generalist instructable embodied agents.
Submission Number: 130
Loading