- Keywords: reinforcement learning, self-supervised learning, data efficiency, generalization
- Abstract: Vision-based reinforcement learning requires efficient and robust representations of image-based observations, especially when the images contain distracting (task-irrelevant) elements such as shadows, clouds, and light. It becomes even more difficult if those distractions are not exposed during training. Although several recent studies have shown that representation robustness can be improved, they still suffer from relatively low performance even in simple and static backgrounds. To enhance the quality of representation, we design an RL framework that combines three different self-supervised learning methods; 1) Adversarial Representation, 2) Forward Dynamics, and 3) Inverse Dynamics. For a set of continuous control tasks on the DeepMind Control suite, our joint self-supervised RL (JS2RL) efficiently learns the task control in both simple and distracting backgrounds, and significantly improves generalization performance for unseen backgrounds. In an autonomous driving task, CARLA, our JS2RL also achieved the best performance on complex and realistic observations containing a lot of task-irrelevant information.
- One-sentence Summary: Joint self-supervised learning can significantly improve data efficiency and generalization performance in vision-based control tasks.
- Supplementary Material: zip