All samples are from five validation runs after 1M steps of training (unless mentioned otherwise).
Distracting Control Suite
Walker Easy
In this setting, the agent can learn to control the walker within 1M steps.
Walker Medium
The agent works reasonably well but occasional failures happen.
Walker Hard
Failures happen frequently at 1M steps of training.
Walker Hard at 2M steps
Failures almost gone at 2M steps of training.
Cartpole Easy
Learns to balance the pole with occasional failures.
Cartpole Hard
Cannot solve the task at 1M steps.
Cartpole Hard at 2M steps
Success happens but failure still common at 2M steps.
Robosuite Door Opening
Panda static
Panda dynamic
Jaco static
Visualization using gating masks
Cheetah
Reacher
Finger
Door Opening
Such masks are not obtained when we use the same gating-enabled encoders with baseline SAC+RAD model.
Long-term predictions
Since our model avoids the use of reconstruction during training, we do not have a pixel decoder to help visualize the model's long-term predictions. However, we can obtain such a decoder just for probing the latent space using parallel distraction-free versions of the observations. We train a pixel decoder to predict the distraction-free observations from the model's detached contrastive prediction obtained by applying a stop-gradient operation on it. Top : the agent's observations. Middle : the
predicted next step observation from the model's latent state using the
trained probe. Bottom the decoded observations from the
obtained by rolling out the model's latent state in imagination.
The imagination rollout starts after 50 time steps (when the green
border switches to blue).