Abstract: We present an approach for building an active agent that learns to segment its visual observations into individual objects by interacting with its environment in a completely self-supervised manner. The agent uses its current segmentation model to infer pixels that constitute objects and refines the segmentation model by interacting with these pixels. The model learned from over 50K interactions generalizes to novel objects and backgrounds. Data collection by interaction is natural and a noisy source of information. We propose a robust set loss to deal with noisy training signal and provide a benchmark dataset comprising robot interactions with few human labeled examples for future research to build upon. We provide evidence that re-organization of visual observations into objects is a powerful representation for downstream vision-based control tasks. Our system is capable of rearranging multiple objects into target configurations from visual inputs alone. Full paper available at https://pathak22.github.io
0 Replies
Loading