\section{Method}
\label{sec:method}

\subsection{Object Proposals}
We propose \modelname{} (\modelcode), an object-centric method for learning stable object identities. Standard Slot Attention represents a scene using exchangeable slots, allowing any slot to bind to any object. While this supports object discovery, it does not ensure that the same slot represents the same object across observations. A reconstruction loss only requires accurate input reconstruction, not consistent identity assignment. As a result, slots may switch between objects or encode short-term cues such as position, color, texture, or pose.

\modelcode{} addresses this limitation by explicitly separating object identity from observation-specific state. Given observations $\mathcal{X}=\{x_t\}_{t=1}^{T}$, each object is factorized into a persistent lookup-free discrete identity token and a continuous state vector. The identity token captures stable object information, while the state vector captures changing attributes such as pose, position, scale, and local appearance. Using vector quantization, aggregated object evidence is mapped to a reusable discrete identity code, creating an identity bottleneck that encourages stable identity representation. The decoder reconstructs the input from the combined identity--state representation, as shown in Fig.~\ref{fig:grounded-identity-state}.

\input{pipeline}

\subsection{Object Proposals}
For each frame $x_t \in [0,1]^{3\times H\times W}$, where $H$ and $W$ denote height and width, we obtain $K$ object masks $\{m_{t,k}\}_{k=1}^{K}$ using an external class-agnostic segmentation model such as \cite{hqes,sam1,sam2,sam3}. Here, $K$ is the number of object proposals in frame $t$, and $m_{t,k}\in\{0,1\}^{H\times W}$ is the binary mask for object $k$. Each mask is converted into an RGB-A object-focused input $x_{t,k}$ using AlphaCLIP \cite{sun2024alpha}, where the $\alpha$ channel highlights the target object while preserving scene context. These object proposals provide the inputs to the identity encoder.

To maintain object identity across time, we use a memory bank that associates frame-level object proposals belonging to the same physical object. Specifically, each memory slot $M_i$ stores the temporally grouped proposals of one persistent object identity. The memory bank therefore converts frame-level proposals into object-level video tubes:
\begin{equation}
\mathcal{V}^{(i)} = \{x_{t,k} : x_{t,k} \rightarrow M_i\},
\end{equation}
where $\mathcal{V}^{(i)}$ denotes the sequence of RGB-A object-focused inputs associated with object identity $i$.

\subsection{Identity Encoder}

The identity encoder produces a \emph{time-invariant} representation for each persistent object. After the memory bank groups frame-level object proposals into object-level video tubes, each tube
\(\mathcal{V}^{(i)}\) contains the RGB-A object-focused inputs associated with object identity \(i\). The identity encoder processes all proposals in this tube and aggregates them into a single stable identity embedding.

For each object-focused input \(x_{t,k}\in \mathcal{V}^{(i)}\), we extract a continuous identity feature using the Perception Encoder (PE) \cite{bolya2025perception}:
\begin{equation}
z_{t,k} = q_{\phi_z}(x_{t,k}) \in \mathbb{R}^{D}.
\end{equation}
Here, \(z_{t,k}\) captures the appearance and shape evidence of the object proposal at frame \(t\). Since individual frames may contain pose changes, partial occlusion, or viewpoint variation, we aggregate all features assigned to the same memory:
\begin{equation}
\bar{z}^{(i)}
=
\frac{1}{|\mathcal{V}^{(i)}|}
\sum_{x_{t,k}\in \mathcal{V}^{(i)}}
z_{t,k}.
\end{equation}
The aggregated feature \(\bar{z}^{(i)}\) serves as the continuous identity representation of object \(i\). This temporal aggregation makes the identity code stable across frames and robust to late object entry, temporary occlusion, and appearance changes.

We then map the aggregated identity feature to a compact lookup-free discrete identity code similar to \cite{yu2023language}:
\begin{equation}
a^{(i)}
=
g_{\phi_z}\!\left(\bar z^{(i)}\right)
\in \mathbb{R}^{B},
\qquad
B=\log_2 K_{\mathrm{id}} .
\end{equation}

The binary LFQ code is obtained by thresholding each dimension:
\begin{equation}
b^{(i)}_r
=
2\,\mathbf{1}_{\{a^{(i)}_r \geq 0\}} - 1,
\qquad r=1,\dots,B .
\end{equation}

The corresponding integer identity token is:
\begin{equation}
\operatorname{Index}\!\left(b^{(i)}\right)
=
\sum_{r=1}^{B}
2^{r-1}
\mathbf{1}_{\{b^{(i)}_r > 0\}} .
\end{equation}

During training, we use a straight-through estimator to pass gradients through the binary quantizer:
\begin{equation}
\tilde b^{(i)}
=
a^{(i)}
+
\operatorname{sg}\!\left(
b^{(i)} - a^{(i)}
\right).
\end{equation}

Finally, the quantized identity code is projected into the decoder space:
\begin{equation}
z_{\mathrm{id}}^{(i)}
=
W_{\mathrm{id}} \tilde b^{(i)} + c_{\mathrm{id}} .
\end{equation}

The resulting \(z_{\mathrm{id}}^{(i)}\) is shared across all frames of object \(i\) and is used as the persistent identity anchor for both state estimation and reconstruction.

\subsection{State Encoder}

The state encoder models the \emph{time-varying} configuration of each object, such as pose, position, scale, deformation, occlusion, and motion-related appearance changes. 
For each frame $t$, the segmentation model produces a set of object masks
\[
\mathcal{M}_t=\{m_{t,k}\}_{k=1}^{K_t}, 
\qquad 
m_{t,k}\in[0,1]^{H\times W}.
\]
The memory bank assigns each frame-level mask $m_{t,k}$ to a persistent memory slot $M_i$. 
After assignment, we denote by $m_{t,i}$ the mask associated with memory slot $i$ in frame $t$. 
The state encoder then produces one frame-specific state vector $s^{(t,i)}$ for each visible memory slot.

Unlike the identity encoder, which aggregates evidence across time, the state encoder is conditioned only on the current frame and the current mask. 
The mask defines the spatial support of the object but is not treated as an identity representation. 
Given frame features
\begin{equation}
H_t = F_{\psi}(x_t),
\end{equation}
we extract an object-level state feature using mask-conditioned pooling:
\begin{equation}
h_{t,i}
=
\mathrm{MaskPool}(H_t,m_{t,i}).
\end{equation}
We also encode mask geometry as
\begin{equation}
g_{t,i}=\gamma(m_{t,i}),
\end{equation}
where $\gamma(\cdot)$ captures spatial properties such as area, centroid, shape, and extent.

The state posterior is parameterized as
\begin{equation}
q_{\phi_s}
\left(
s^{(t,i)} \mid x_t,m_{t,i}
\right)
=
\mathcal{N}
\left(
\mu_{t,i},
\mathrm{diag}(\sigma_{t,i}^{2})
\right),
\end{equation}
where
\begin{equation}
(\mu_{t,i},\log\sigma_{t,i}^{2})
=
E_{\mathrm{st}}(h_{t,i},g_{t,i}).
\end{equation}
The state latent is sampled using the reparameterization trick:
\begin{equation}
s^{(t,i)}
=
\mu_{t,i}
+
\sigma_{t,i}\odot\epsilon,
\qquad
\epsilon\sim\mathcal{N}(0,I).
\end{equation}

Thus, the state encoder produces a separate time-varying state vector for each object mask assigned to a memory slot, while persistent identity is handled separately by the memory-associated LFQ identity code.

\subsection{Decoder}

For each object $k$ in frame $t$, the spatial broadcast decoder \cite{watters2019spatial,greff2019multi} receives two factorized latent variables: 
a projected LFQ identity representation $z_{\mathrm{id}}^{(k)}$ and a frame-specific continuous state latent $s^{(t,k)}$.

The LFQ identity code is first projected into the decoder space:
\begin{equation}
z_{\mathrm{id}}^{(k)}
=
W_{\mathrm{id}}\tilde{b}^{(k)} + c_{\mathrm{id}},
\end{equation}
where $\tilde{b}^{(k)}$ is the straight-through version of the lookup-free binary
identity code.

The projected LFQ identity representation and the state latent are concatenated and passed to a shared decoder:
\begin{equation}
u^{(t,k)}
=
\left[
z_{\mathrm{id}}^{(k)} ;
s^{(t,k)}
\right].
\end{equation}

The shared decoder reconstructs the object appearance from the joint identity--state representation:
\begin{equation}
(\hat{v}^{(t,k)}, \hat{\alpha}^{(t,k)})
=
D_{\theta}
\left(
u^{(t,k)}
\right),
\end{equation}
where $\hat{v}^{(t,k)}$ is the RGB reconstruction of object $k$ in frame $t$,
and $\hat{\alpha}^{(t,k)}$ is the corresponding alpha mask.

The object reconstructions are composed into the final frame by normalizing the
alpha masks across objects:
\begin{equation}
\hat{x}_t
=
\sum_{k}
\mathrm{softmax}_{k}
\left(
\hat{\alpha}^{(t,k)}
\right)
\hat{v}^{(t,k)} .
\end{equation}

Reconstruction is conditioned jointly on two latent factors. In our case, $z_{\mathrm{id}}^{(k)}$ represents the projected time-invariant LFQ identity code,
while $s^{(t,k)}$ represents the time-varying object state, including pose, position, motion, deformation, occlusion, and local appearance. The two variables
are learned by separate encoders and are only combined at decoding time.

\subsection{Objective}

We optimize an ELBO-style objective with lookup-free quantization for identity and a contrastive separation term between identity and state. The temporal
identity-consistency loss is removed. Since each memory slot stores a persistent identity code and the decoder pairs this identity code with the state code from
the same memory slot, temporal identity consistency is enforced architecturally through the memory bank.

The full training objective is
\begin{equation}
    \label{eq:psc_lfq_objective_revised}
    \begin{aligned}
        \mathcal{L}_{\mathrm{PSC\text{-}LFQ}}
        ={}&
        -\mathbb{E}_{\mathbf{s} \sim q_{\phi_s}(\mathbf{s} \mid \mathbf{x}, \mathbf{o})}
        \Big[
            \log p_{\theta}
            \big(
                \mathbf{x}
                \mid
                \mathbf{s},
                \mathbf{z}_{\mathrm{id}},
                \mathbf{o}
            \big)
        \Big]
        \\
        &+
        \lambda_s
        D_{\mathrm{KL}}
        \big(
            q_{\phi_s}(\mathbf{s} \mid \mathbf{x}, \mathbf{o})
            \,\|\,
            p(\mathbf{s})
        \big)
        \\
        &+
        \beta \mathcal{L}_{\mathrm{commit}}
        +
        \lambda_{\mathrm{entropy}}
        \mathcal{L}_{\mathrm{entropy}}
        \\
        &+
        \lambda_{\mathrm{con}}
        \mathcal{L}_{\mathrm{con}}^{s,\mathrm{id}} .
    \end{aligned}
\end{equation}

Here, $\mathbf{x}$ denotes the input video, $\mathbf{o}$ denotes the set of object proposals, $\mathbf{s}$ denotes the continuous time-varying state latents, and $\mathbf{z}_{\mathrm{id}}$ denotes the LFQ identity representations stored in the memory bank. The state posterior is now conditioned on the frame and object proposal:
\begin{equation}
q_{\phi_s}
\left(
\mathbf{s}
\mid
\mathbf{x}, \mathbf{o}
\right),
\end{equation}
rather than on identity codes. The decoder likelihood reconstructs the video
from memory-paired identity and state representations:
\begin{equation}
p_{\theta}
\left(
\mathbf{x}
\mid
\mathbf{s},
\mathbf{z}_{\mathrm{id}},
\mathbf{o}
\right).
\end{equation}

The identity encoder maps temporally aggregated object evidence to LFQ logits
$\mathbf{a}^{(i)} \in \mathbb{R}^{B}$ for memory slot $i$, where
$B=\log_2 K_{\mathrm{id}}$ and $K_{\mathrm{id}}$ is the size of the identity
vocabulary. The LFQ quantizer produces a binary identity code:
\begin{equation}
    \mathbf{b}^{(i)}
    =
    q_{\mathrm{LFQ}}(\mathbf{a}^{(i)}),
    \qquad
    b^{(i)}_r
    =
    2\mathbf{1}\{a^{(i)}_r \ge 0\} - 1 ,
    \quad r=1,\ldots,B .
\end{equation}

During training, gradients are passed through the non-differentiable binary
quantizer using a straight-through estimator:
\begin{equation}
    \widetilde{\mathbf{b}}^{(i)}
    =
    \mathbf{a}^{(i)}
    +
    \mathrm{sg}
    \big(
        \mathbf{b}^{(i)} - \mathbf{a}^{(i)}
    \big),
\end{equation}
where $\mathrm{sg}(\cdot)$ denotes the stop-gradient operation. The projected
identity representation stored in the memory bank and used by the decoder is
\begin{equation}
    \mathbf{z}_{\mathrm{id}}^{(i)}
    =
    W_{\mathrm{id}}
    \widetilde{\mathbf{b}}^{(i)}
    +
    \mathbf{c}_{\mathrm{id}} .
\end{equation}

The LFQ commitment loss encourages the continuous logits to commit to their
assigned binary identity codes:
\begin{equation}
    \mathcal{L}_{\mathrm{commit}}
    =
    \frac{1}{N_{\mathrm{mem}}}
    \sum_{i=1}^{N_{\mathrm{mem}}}
    \left\|
        \mathbf{a}^{(i)}
        -
        \mathrm{sg}
        \big[
            \mathbf{b}^{(i)}
        \big]
    \right\|_2^2 ,
\end{equation}
where $N_{\mathrm{mem}}$ is the number of active memory slots in the minibatch.

We also use the LFQ entropy codebook-utilization loss:
\begin{equation}
    \mathcal{L}_{\mathrm{entropy}}
    =
    \mathbb{E}
    \big[
        H(q_{\mathrm{LFQ}}(\mathbf{a}))
    \big]
    -
    H
    \big(
        \mathbb{E}
        [
            q_{\mathrm{LFQ}}(\mathbf{a})
        ]
    \big).
\end{equation}
In the binary LFQ case, this can be written using soft probabilities
$p^{(i)}_r=\sigma(a^{(i)}_r/\tau)$:
\begin{equation}
    \mathcal{L}_{\mathrm{entropy}}
    =
    \frac{1}{N_{\mathrm{mem}}}
    \sum_{i=1}^{N_{\mathrm{mem}}}
    \sum_{r=1}^{B}
    h(p^{(i)}_r)
    -
    \sum_{r=1}^{B}
    h(\bar{p}_r),
    \qquad
    \bar{p}_r
    =
    \frac{1}{N_{\mathrm{mem}}}
    \sum_{i=1}^{N_{\mathrm{mem}}}
    p^{(i)}_r ,
\end{equation}
where $h(p)=-p\log p-(1-p)\log(1-p)$.

Finally, we add a contrastive identity--state separation loss to discourage the state latent from encoding identity-specific information. Since the state
encoder no longer receives $z_{\mathrm{id}}^{(i)}$, this loss acts as an additional disentanglement constraint rather than as an assignment mechanism.

We first project state and identity into a shared contrastive space:
\begin{equation}
    \mathbf{u}_{t,i}
    =
    \frac{
        r_s(\mathbf{s}^{(t,i)})
    }{
        \|r_s(\mathbf{s}^{(t,i)})\|_2
    },
    \qquad
    \mathbf{v}_{i}
    =
    \frac{
        r_{\mathrm{id}}(\mathbf{z}_{\mathrm{id}}^{(i)})
    }{
        \|r_{\mathrm{id}}(\mathbf{z}_{\mathrm{id}}^{(i)})\|_2
    } .
\end{equation}

The separation loss is
\begin{equation}
    \mathcal{L}_{\mathrm{con}}^{s,\mathrm{id}}
    =
    \frac{1}{\sum_i |\mathcal{T}_i|}
    \sum_i
    \sum_{t\in\mathcal{T}_i}
    \left[
        \max
        \left(
            0,
            \mathbf{u}_{t,i}^{\top}
            \mathrm{sg}[\mathbf{v}_i]
            -
            m
        \right)
    \right]^2 ,
\end{equation}
where $\mathcal{T}_i$ is the set of frames where memory slot $i$ is visible and $m$ is a similarity margin. We apply stop-gradient to the identity projection in
this term so that the contrastive loss primarily removes identity information from the state representation, while the identity code itself remains governed
by LFQ, commitment, entropy utilization, and reconstruction.

The scalars $\lambda_s$, $\beta$, $\lambda_{\mathrm{entropy}}$, and $\lambda_{\mathrm{con}}$ control the state KL regularization, LFQ commitment
loss, entropy utilization loss, and identity--state contrastive separation loss, respectively. We minimize Eq.~\eqref{eq:psc_lfq_objective_revised} with respect to the state encoder, identity encoder, decoder, LFQ projection parameters, and contrastive projection heads.
