ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics

Letian Chen; Nina Marie Moorman; Matthew Craig Gombolay

ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics

Letian Chen, Nina Marie Moorman, Matthew Craig Gombolay

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

TL;DR: ELEMENTAL enables robots to learn user-aligned reward functions by combining user language instructions and demonstration through interactive self-reflection using vision-language models.

Abstract: Reinforcement learning (RL) has demonstrated compelling performance in robotic tasks, but its success often hinges on the design of complex, ad hoc reward functions. Researchers have explored how Large Language Models (LLMs) could enable non-expert users to specify reward functions more easily. However, LLMs struggle to balance the importance of different features, generalize poorly to out-of-distribution robotic tasks, and cannot represent the problem properly with only text-based descriptions. To address these challenges, we propose ELEMENTAL (intEractive LEarning froM dEmoNstraTion And Language), a novel framework that combines natural language guidance with visual user demonstrations to align robot behavior with user intentions better. By incorporating visual inputs, ELEMENTAL overcomes the limitations of text-only task specifications, while leveraging inverse reinforcement learning (IRL) to balance feature weights and match the demonstrated behaviors optimally. ELEMENTAL also introduces an iterative feedback-loop through self-reflection to improve feature, reward, and policy learning. Our experiment results demonstrate that ELEMENTAL outperforms prior work by 42.3% on task success, and achieves 41.3% better generalization in out-of-distribution tasks, highlighting its robustness in LfD.

Lay Summary: Teaching robots to perform tasks can be hard, especially when people struggle to describe exactly what they want. ELEMENTAL is a new method that helps robots learn by watching humans demonstrate tasks and following their natural language instructions. It uses vision-language models that understand both images and language to figure out what matters for success, and then lets users refine the robot’s behavior by giving feedback. ELEMENTAL was tested in both simulation and a real-world salad-making task, where people taught a robot using video recordings and natural-language instructions—no coding or manual labeling needed. The robot learned more human-aligned behaviors than existing systems, showing promise for easier and more intuitive robot training in the future.

Primary Area: Applications->Robotics

Keywords: Learning from Demonstration, Vision-Language Models, Inverse Reinforcement Learning

Flagged For Ethics Review: true

Submission Number: 14019

Loading