TARAD: Task-Aware Robot Affordance-Centric Diffusion Policy Learned From LLM-Generated Demonstrations

Site Hu, Takayuki Nagai, Takato Horii

Published: 2025, Last Modified: 01 May 2026IEEE Robotics Autom. Lett. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In open-ended task settings, the ability of a robot to execute diverse tasks accurately by following language instructions is critical. Methods based on traditional imitation learning typically depend on extensive expert demonstrations and often struggle to generalize in the case of unseen scenarios or tasks. Recently, approaches leveraging large foundational models have demonstrated improved generalization by enhancing task comprehension in novel scenarios based on the intrinsic world knowledge embedded in these models. However, these methods rely on predefined motion primitives and lack a detailed understanding of the environment, which is essential for successful execution. Herein we introduce Task-Aware Robot Affordance-Centric Diffusion Policy (TARAD), a novel framework for robot manipulation. TARAD leverages large language models and vision-language models to perform high-level planning from natural language instructions and extract affordance information from the robot's observations. A heuristic motion planner is employed for low-level motion planning, enabling zero-shot trajectory synthesis and the fully automatic generation of a dataset with language labels and affordances. By incorporating affordances into the observation space, our approach integrates the intrinsic commonsense and reasoning capabilities of foundation models into imitation learning, enabling the training of an affordance-centric, multi-task three-dimensional (3D) diffusion policy. Empirical evaluations in both the RLBench simulated environments and real-world experiments with UR5e demonstrate that TARAD effectively combines the precise control of imitation learning with the strong generalization capabilities of foundation models, all without relying on expert demonstrations or predefined motion primitives.