Language-conditioned world model improves policy generalization by reading environmental descriptions
Keywords: language-conditioned world model, model-based, world model
TL;DR: we show that a language-conditioned world model can improve policy generalization in tasks that requires understanding environmental descriptions
Abstract: To interact effectively with humans in the real world, it is important for agents to understand language that describes the dynamics of the environment---that is, how the environment behaves---rather than just task instructions specifying what to do.
For example, a cargo-handling robot might receive a statement like "the floor is slippery so pushing any object on the floor will make it slide faster than usual".
Understanding this dynamics-descriptive language is important for human-agent interaction and agent behavior.
Recent work address this problem using a model-based approach: language is incorporated into a world model, which is then used to learn a behavior policy.
However, these existing methods either do not demonstrate policy generalization to unseen language or rely on limiting assumptions.
For instance, assuming that the latency induced by inference-time planning is tolerable for the target task or that expert demonstrations are available.
Expanding on this line of research, we focus on improving policy generalization from a language-conditioned world model while dropping these assumptions.
We propose a model-based reinforcement learning approach, where a language-conditioned world model is trained through interaction with the environment, and a policy is learned from this model---without planning or expert demonstrations.
Our method proposes Language-aware Encoder for Dreamer World Model (LED-WM) built on top of DreamerV3.
LED-WM features an observation encoder that uses an attention mechanism to explicitly ground language descriptions to entities in the observation.
We show that policies trained with LED-WM generalize more effectively to unseen games described by novel dynamics and language compared to other baselines in several settings in two environments: MESSENGER and MESSENGER-WM.
To highlight how the policy can leverage the trained world model before real-world deployment, we demonstrate the policy can be improved through fine-tuning on synthetic test trajectories generated by the world model.
Submission Type: Research Paper (4-9 Pages)
Submission Number: 14
Loading