CoRe agents running in various environments

All samples are from five validation runs after 1M steps of training (unless mentioned otherwise).

Distracting Control Suite

Walker Easy In this setting, the agent can learn to control the walker within 1M steps.
Walker Medium
The agent works reasonably well but occasional failures happen.
Walker Hard Failures happen frequently at 1M steps of training.
Walker Hard at 2M steps Failures almost gone at 2M steps of training.
Cartpole Easy Learns to balance the pole with occasional failures.
Cartpole Hard Cannot solve the task at 1M steps.
Cartpole Hard at 2M steps Success happens but failure still common at 2M steps.

Robosuite Door Opening

Panda static
Panda dynamic
Jaco static

Visualization using gating masks

Cheetah
Reacher
Finger
Door Opening
Such masks are not obtained when we use the same gating-enabled encoders with baseline SAC+RAD model.

Long-term predictions

Since our model avoids the use of reconstruction during training, we do not have a pixel decoder to help visualize the model's long-term predictions. However, we can obtain such a decoder just for probing the latent space using parallel distraction-free versions of the observations. We train a pixel decoder to predict the distraction-free observations from the model's detached contrastive prediction obtained by applying a stop-gradient operation on it.
Top : the agent's observations. Middle : the predicted next step observation from the model's latent state using the trained probe. Bottom the decoded observations from the obtained by rolling out the model's latent state in imagination. The imagination rollout starts after 50 time steps (when the green border switches to blue).
Cheetah Reacher Walker