Abstract: Although Slot Attention (SA) models are widely adopted for object-centric representation learning, they typically assume a shared initialization distribution from which all slots are randomly sampled. This assumption limits their capacity to learn specialized slots
that are consistently associated with particular object categories and remain robust to identity-preserving variations in object appearance. To address this limitation, we introduce Probabilistic Superpixel Coding (PSC), an object-centric representation learning method
that replaces random slot initialization with a lookup-free identity code-book initialization. Given a set of object proposals, Probabilistic Superpixel Coding factorizes each object representation into two components: (i) an identity token and (ii) a state vector that captures instance-specific variation. We evaluate Probabilistic Superpixel Coding across object identity-stability measures, out-of-distribution grounding, and downstream compositional reasoning tasks. The results demonstrate that Probabilistic Superpixel Coding learns more
stable and reusable object representations than slot-based baselines.
Beyond Pdf: zip
Submission Type: Beyond PDF submission (pageless, webpage-style content)
Previous TMLR Submission Url: https://openreview.net/forum?id=Tt7kSReDbK¬eId=Tt7kSReDbK
Changes Since Last Submission: Since the previous TMLR submission, we made several substantial revisions to clarify and strengthen the method. First, we replaced the learned codebook with a lookup-free quantization mechanism, where object identity is represented by a binary LFQ code $b^{(i)}\in\{-1,+1\}^{B}$ rather than by an explicit codebook lookup. Second, we replaced the previous orthogonality constraint with a contrastive identity--state separation loss to more directly discourage leakage between the identity representation $z_{\mathrm{id}}^{(i)}$ and the state representation $s^{(t,i)}$. Third, we introduced a memory bank to associate object proposals across time, which allows the model to handle occlusion, disappearance, and late object entry without relying on a first-frame anchor. Fourth, we revised the state encoder by removing graph-based grouping and conditioning it directly on the segmentation mask, i.e., $q_{\phi_s}(s^{(t,i)}\mid x_t,m_{t,i})$, making the method easier to understand and better aligned with the proposal pipeline. Finally, we added experiments focused on the paper's main claim, namely stable object identity learning, and revised the manuscript to address the reviewers' comments.
Assigned Action Editor: ~Mengmi_Zhang1
Submission Number: 7836
Loading