Unsupervised video decomposition on Sketchy
-------------------------------------------

In these experiments, we train and qualitatively evaluate an unsupervised SAVi model with RGB reconstruction on the Sketchy dataset [1,2]. The Sketchy dataset contains videos of a real-world robotic grasper interacting with various objects.

We use the human demonstration sequences as part of the "rgb30__all" subset of the Sketchy dataset, resulting in a total of 2930 training videos of 201 frames each. In the supplementary material, we include qualitative results on unseen validation set videos.

Our qualitative results (see supplementary material files below) demonstrate that SAVi can decompose these real-world scenes into meaningful object components and consistently represent and track individual scene components over long time horizons, far beyond what is observed during training. SAVi is trained with a per-slot MLP predictor model (with a single hidden layer of 256 hidden units, a skip connection, and Layer Normalization) for 1M steps with otherwise the same architecture as the unsupervised SAVi model used for CATER described in the main paper.

We compare SAVi to a re-implementation of the SIMONe [3] model. We find that the SIMONe baseline has a stronger (yet unintended) bias to segment by color, and, for example, splits the green cube into two separate segments representing different shades of green, while merging the darkest components in the scene (the gripper arm, a black patch on the stem of the gripper, and the blue cube) into a single object slot. Because the model segments in part by color, object slots remain largely consistent across the video, even though SIMONe is trained and evaluated on consecutive, independent sub-clips of 16 frames each, without carrying any latent state or history into the following sub-clip. At the boundaries between sub-clips, small inconsistencies can be observed.

Supplementary material files:
* README.txt -- This file.
* sketchy_savi_video_1.mp4 & sketchy_savi_video_2.mp4 -- Videos of 200 frames each showing predictions of an unsupervised SAVi model.
* sketchy_simone_video_1.mp4 & sketchy_simone_video_2.mp4 -- Videos of 200 frames each showing predictions of an unsupervised SIMONe baseline model.
* sketchy_savi.pdf -- Visualization of per-slot reconstructions for SAVi.
* sketchy_simone.pdf -- Visualization of per-slot reconstructions for SIMONe.

To allow for easier interpretability, we only show segmentation masks corresponding to discovered foreground objects and parts in color, whereas we visualize all segmentation masks corresponding to discovered background slots in black (for both SAVi and SIMONe).

[1] Cabi et al., Scaling data-driven robotics with reward sketching and batch reinforcement learning (2019)
[2] https://github.com/deepmind/deepmind-research/tree/master/sketchy
[3] Kabra et al., SIMONe: View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video Decomposition (2021)
