\section{Introduction}
\label{sec:introduction}

A key step in aligning AI systems with human visual perception is equipping them with a human-like notion of objectness \cite{lake2017building}. Humans do not perceive the world as a stream of pixels, but rather as a collection of object entities that maintain their identity despite changes in viewpoint, lighting, scale, orientation, and articulation \cite{rock1973orientation, kulkarni2015deep, hinton1979some, behrens2018cognitive}. Despite recent progress, several major challenges remain in object-centric representation learning. One such challenge is \emph{identity binding}: a model should associate multiple observations of the same object with more permanent, canonical characteristics of that object, rather than with transient appearance cues alone. Treisman (1999) argued that solving this challenge is a prerequisite for human-like perceptual binding.

Most recent object-centric learning (OCL) methods are based on slot-based approaches \cite{slotattention, kipf2021conditional, cosa, slate, savi++, dinosaur}, which begun exhibiting impressive results, showing potential to scale to complex visual scenes. Yet, despite this empirical progress, these methods primarily optimize object discovery, reconstruction, or short-horizon temporal propagation, and do not explicitly model a canonical notion of object identity.

Slot Attention represents a scene as an unordered set of object slots. Each slot is initialized from the same distribution and updated with the same refinement network, so the slot index itself does not encode object type or object identity \cite{slotattention,cosa,kori2024identifiable}. This design preserves permutation symmetry: permuting the slots should not change the reconstructed scene \cite{slotattention,kori2024identifiable}. Video extensions such as SAVi, SAVi++, and depth-aware slot models improve temporal object discovery, but their standard objectives still optimize scene reconstruction rather than persistent identity assignment \cite{kipf2021conditional,cosa,kori2024identifiable}. Therefore, the model is not explicitly penalized when two slots swap identities across frames, as long as the reconstructed pixels or features remain correct \cite{slotattention,cosa,kori2024identifiable}.

This property makes slots effective for object discovery, but it creates ambiguity for identity learning. A slot can represent the orange object in one frame and a different object in another frame, because the objective rewards accurate decomposition rather than a stable mapping between object identity and slot index \cite{slotattention,cosa,kori2024identifiable}. Similarly, a large object can be split across multiple slots if that decomposition improves reconstruction \cite{slotattention,kori2024identifiable}. Thus, standard slot-based models learn object-centric scene decompositions, but they do not guarantee identity-grounded representations: the same object identity is not guaranteed to keep the same slot or the same canonical representation across time or scenes\cite{cosa,kori2024identifiable,kipf2021conditional}.

This instability is closely related to the \emph{Binding Problem}, that is, how internal representations are connected to objects in the real-world and their meaning \cite{harnad1990symbol, greff2020binding}. This problem has two closely related components: segregating the object from the surrounding scene and learning a representation that remains tied to that object despite changes in pose, context, and appearance.\footnote{Neural networks often rely on surface-level statistical regularities rather than underlying concepts, which can hinder systematic generalization.} Recent advances in class-agnostic segmentation and tracking suggest that the segregation component is now handled increasingly well by modern segmentation models. 

We therefore focus on the representation side of the Binding Problem. We assume object proposals and ask a different question: how should a model represent an object so that the same object is mapped to the same reusable identity despite frame-specific changes in appearance?.

To address this challenge, we propose \modelname\ (\modelcode), a grounded object-centric representation method that factors each object into two complementary components: a reusable identity code and a frame-specific state vector. The identity component is obtained by aggregating evidence across time and mapping it to a discrete identity token, encouraging the same object to be represented by the same canonical code across frames and viewpoints. The state component captures transient attributes such as pose, motion, and local appearance. In contrast to standard slot-based objectives, which make identity only an implicit by-product of successful grouping or reconstruction, our formulation makes temporal identity consistency an explicit modeling target. 

\paragraph{Contributions.}
Our core idea is to replace implicit, exchangeable slot binding with an explicit identity representation that persists across observations. We argue that binding temporary object states to their corresponding permanent object identities can be understood as learning a vocabulary of grounded, canonical object representations. Vector quantization provides a natural mechanism for learning this vocabulary: it maps continuous object representations to a finite set of shared discrete identity codes, encouraging objects with the same underlying identity to reuse the same canonical representation across frames. This reduces the ambiguity of slot permutation and supports more stable cross-frame object binding. To this end, we combine a shared discrete identity vocabulary, cross-frame correspondence, and a factorized identity/state representation to learn object representations that are grounded and temporally stable.

Our main contributions are as follows:

\begin{itemize}
  \item We introduce a grounded identity-tokenization method for object-centric learning. Instead of relying on the standard slot-based objective to induce stable identity implicitly, the model maps temporally aggregated object evidence to a reusable discrete identity code, yielding a canonical representation that can persist across frames and appearance changes.
  \item We make temporal identity consistency an explicit modeling objective. The method matches grounded object observations across frames and aggregates identity evidence over time, rather than depending on exchangeable slots to preserve identity implicitly through reconstruction or temporal propagation alone.
  \item We factorize each object representation into time-invariant identity and time-varying state. This separation allows the method to preserve stable object-level information while independently modelling transient properties such as pose, motion, and local appearance.
\end{itemize}