TL;DR: We propose FOUNDER that grounds Foundation Model task representations into World Model goal states, enabling open-ended task specification and completion in embodied environments by capturing deep-level task semantics.
Abstract: Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended task solving in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations. This mapping enables the learning of a goal-conditioned policy through imagination during behavior learning, with the mapped task serving as the goal state. Our method leverages the predicted temporal distance to the goal state as an informative reward signal. FOUNDER demonstrates superior performance on various multi-task offline visual control benchmarks, excelling in capturing the deep-level semantics of tasks specified by text or videos, particularly in scenarios involving complex observations or domain gaps where prior methods struggle. The consistency of our learned reward function with the ground-truth reward is also empirically validated. Our project website is https://sites.google.com/view/founder-rl.
Lay Summary: Imagine trying to get a robot to understand and perform all sorts of tasks just like we do, say, by telling it "go get an apple from the kitchen" or showing it a quick video. This is actually very hard! Current AI is often either like a knowledgeable "scholar" (Foundation Models) that understands many concepts but doesn't quite know how to "act" in the physical world, or like a skilled "artisan" (World Models) that can simulate physical interactions and learn specific actions but struggles with broad, open-ended instructions.
Our research is like building a vital bridge between this "scholar" and "artisan," enabling them to collaborate effectively. We introduce FOUNDER, a new method that cleverly translates the "scholar's" understanding of a task (for instance, "getting an apple" means the apple ultimately rests in the robot's hand) into a concrete "target snapshot" or goal state within the "artisan's" simulated practice arena. Once this target is clear, the robot can teach itself how to reach it by imagining and rehearsing within its own "mental mini-world" (the World Model), all without us needing to painstakingly define complex reward rules for every single task.
This technology means future robots and AI agents could more easily interpret and execute a wide variety of open-ended instructions we provide, whether through natural language or by watching a video. This could be for exploring in a game or performing complex manipulations in the real world, opening new doors to creating more versatile and intelligent robotic assistants that seamlessly integrate into our lives and work.
Primary Area: Reinforcement Learning->Deep RL
Keywords: Reinforcement Learning, World Models, Foundation Models, Open-ended Tasks, Goal-Conditioned Reinforcement Learning, Multi-task Reinforcement Learning, Learning Reward Functions
Submission Number: 15878
Loading