Keywords: Compositional Learning, Reinforcement Learning, Multimodal Learning
Abstract: Children can rapidly generalize compositionally-constructed rules to unseen test sets. On the other hand, deep reinforcement learning (RL) agents need to be trained over millions of episodes, and their ability to generalize to unseen combinations remains unclear. Here, we investigate the compositional abilities of RL agents, using the task of navigating to instructed color-shape targets in synthetic 3D environments, which allows better control over the train-test split and balance within the data. First, we show that when RL agents are naively trained to navigate to target color-shape combinations, they implicitly learn to decompose the instruction, allowing them to (re-)compose and succeed at held-out test instructions (''compositional learning''). Second, when agents were pretrained to learn invariant shape and color concepts (''concept learning''), the number of episodes subsequently needed for compositional learning decreased by 20$\times$. Furthermore, only agents trained on both concept and compositional learning could solve a more complex, out-of-distribution environment in zero-shot. Finally, we demonstrate that only text encoders pretrained on image-text datasets (e.g. CLIP) reduced the number of training episodes needed for our agents to demonstrate compositional learning, and also generalized in zero-shot to five new colors unseen during training. Overall, our results are the first to demonstrate that RL agents can leverage synthetic data to implicitly learn concepts and compositionality, to solve more complex 3D environments in zero-shot without needing additional training episodes.
Submission Number: 11
Loading