Keywords: Reinforcement Learning, Conservative Q Learning, Offline Reinforcement Learning, Visual Transformers, Transformers
TL;DR: We enhance the Vision Transformer for image-based offline RL by introducing spatio-temporal attention layers and we investigate the impact of various embedding sequence aggregation methods on ViT performance.
Abstract: It has been shown that offline RL methods, such as conservative Q-learning~(CQL), scale favorably for training generalist agents with a ResNet backbone. Recent vision and natural language processing research shows that transformer-based models scale more favorably compared to domain specific models with strong inductive biases (such as convolutional neural networks and recurrent neural networks). In this paper, we investigate how well visual transformers (ViTs) serve as backbones for CQL for training single-game agents. In this work, we enhance the Vision Transformer (ViT) for image-based RL by introducing spatio-temporal attention layers. We further investigate the impact of various embedding sequence aggregation methods on ViT performance. Overall, our modified ViT outperforms the standard ViTs in the single-game Atari setting.
Submission Number: 87
Loading