LaGEA: Language Guided Embodied Agents for Robotic Manipulation

ICLR 2026 Conference Submission21115 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Embodied AI, Vision Language Model, Reinforcement Learning
TL;DR: LaGEA turns structured natural-language self-reflections into temporally grounded reward shaping for RL, improving embodied manipulation benchmark success with faster convergence.
Abstract: Robotic manipulation benefits from foundation models that describe goals, but today’s agents still lack a principled way to learn from their own mistakes. We ask whether natural language can serve as feedback, an error-reasoning signal that helps embodied agents diagnose what went wrong and correct course. We introduce **LaGEA**, a framework that turns episodic, schema-constrained reflections from a vision language model (VLM) into temporally grounded guidance for reinforcement learning. LaGEA summarizes each attempt in concise language, localizes the decisive moments in the trajectory, aligns feedback with visual state in a shared representation, and converts goal progress and feedback agreement into bounded, step-wise shaping rewards whose influence is modulated by an adaptive, failure-aware coefficient. This design yields dense signals early when exploration needs direction and gracefully recedes as competence grows. On the Meta-World MT10 embodied manipulation benchmark, LaGEA improves average success over the state-of-the-art (SOTA) methods by 9.0\% on random goals and 5.3\% on fixed goals, while converging faster. These results support our hypothesis: language, when structured and grounded in time, is an effective mechanism for teaching robots to self-reflect on mistakes and make better choices. Code will be released soon.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 21115
Loading