- Keywords: self-supervision, Reinforcement Learning, auxiliary targets, generalization, U-Net
- TL;DR: We train a U-Net model to generate masks over reward-related objects in images. Our approach allows training the U-Net model without explicit label information, but only using feedback from a critic model trained using a DRL technique.
- Abstract: We train a U-Net model to generate masks over reward-related objects in images. Our approach allows to train the U-Net model without explicit label information, but only using feedback from a critic model which learned to estimate the expected-reward value of an image observation. The masking is learned in contrastive fashion with image pairs using an adversarial scheme for employing the critic score gradient with respect to the mask operation: The pair consists of two images, where the first has a high and the second a low critic value. Training with such pairs enables the U-Net model to produce masks that decrease the critic value in the first image and increase the critic value in the second image when transferring pixels in the masked segment from the first to the second image. The training of the U-Net model is based on an imitation database from the NeurIPS 2020 MineRL Competition Track, where our agent took the ?-place winning entry. Video demonstration: www.rebrand.ly/Rewarding-Objects-mp4