\section{Experiments}
\label{sec:exps}

This section evaluates \textsc{\modelcode} across four experiments: object
identity stability, identity--state factorization, OOD grounding and
sample-efficient transfer, and visual reasoning transfer. Together, these
experiments test whether the model maintains stable object identity over time,
separates identity from time-varying state, generalizes under appearance and
context shift, and transfers to downstream reasoning tasks with limited
supervision.

We also report PSC variants to isolate key components: PSC-noLFQ removes
lookup-free quantization, PSC-noTempAgg removes temporal aggregation, and
PSC-noSPState removes the superpixel-guided state encoder. Additional
implementation details and non-regression results are provided in the appendix.

% -------------------------------------------------------------------------
\subsection{Object Identity Stability}
\label{sec:identity_stability}
OCL methods are commonly evaluated in terms of unsupervised object discovery, where performance is measured by the ability to segment individual object instances in a scene. We explore whether object-centric models can bind representation to specific object types. We define stable object identity operationally as \emph{persistent latent-object correspondence}: the same object should remain assigned to the same slot and should maintain a stable representation. By stable representation, we mean that the latent features assigned to the same object remain consistent, exhibiting low intra-object variance or standard deviation.


\paragraph{Datasets.}We utilized the MS COCO \cite{coco} dataset for its diverse collection of real-world images, each featuring multiple co-occurring objects. This dataset poses a significant challenge for object-centric learning models due to the complexity of the scenes. Furthermore, we used the synthetic datasets MOVi-C and MOVi-E \cite{movi-e}, which contain approximately 1000 realistic
3D-scanned objects. MOVi-C includes scenes with 3-10 objects, whereas MOVi-E contains scenes with 11-23 objects per scene. 

\paragraph{Setup.}
For MS COCO, we evaluated the stability of category-level representations. Since COCO contains static images and does not provide temporal object tracks, we group object instances by their semantic label. For each image, we match each ground-truth instance mask to the predicted slot with the highest mask IoU and use the corresponding slot embedding as the object representation. Images with multiple objects are handled at the instance level and multiple instances of the same category are eliminated.

For MOVi-C and MOVi-E, we evaluate the stability of the temporal identity. These datasets provide frame-level masks, visibility information, and persistent object IDs. For each visible object in each frame, we assign the object to the predicted slot with the highest mask IoU and track both the slot assignment and the slot representation across time.

\paragraph{Metrics.}
We report three identity-stability metrics. 
\textbf{Representation Deviation} (RD) measures the temporal variation of the
matched object representation:
\[
\mathrm{RD}_i =
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i}
\|z_i^t-\bar{z}_i\|_2,
\quad
\bar{z}_i =
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i}z_i^t.
\]
\textbf{Identification Rate} (IDR@1) measures whether an object representation
at frame $t$ retrieves the same object at a later frame $t+\Delta$:
\[
\mathrm{IDR@1} =
\frac{1}{|\mathcal{Q}|}
\sum_{(i,t,\Delta)\in\mathcal{Q}}
\mathbb{1}
\left[
\arg\max_j \cos(z_i^t,z_j^{t+\Delta})=i
\right].
\]

\textbf{Slot Assignment Consistency}
(SAC) measures how often object $O_i$ is assigned to its dominant slot:
\[
\mathrm{SAC}_i =
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i}
\mathbb{1}[s_t(i)=s_i^\star],
\quad
s_i^\star=\operatorname{mode}(\{s_t(i)\}_{t\in\mathcal{T}_i}).
\]

\begin{table}[t]
  \centering
  \caption{
  Category-level representation stability on MS COCO. Object instances are
  grouped by semantic category. RD measures intra-category representation
  deviation; lower is better. IDR@1 measures whether the nearest retrieved
  representation belongs to the same object category; higher is better.
  }
  \resizebox{0.48\textwidth}{!}{
  \begin{tabular}{lcc}
    \toprule
    \textsc{Method} &
    \textsc{RD} $\downarrow$ &
    \textsc{IDR@1} $\uparrow$ \\
    \midrule
    Slot Attention \cite{slotattention}
      & $0.312{\pm}0.021$ & $54.8{\pm}2.6$ \\

    SAVI \cite{kipf2021conditional}
      & $0.274{\pm}0.019$ & $59.6{\pm}2.3$ \\

    SAVI++ \cite{savi++}
      & $0.251{\pm}0.017$ & $62.7{\pm}2.1$ \\

    SOLD \cite{mosbach2024sold}
      & $0.238{\pm}0.016$ & $64.2{\pm}2.0$ \\

    CoSA-GSD \cite{cosa}
      & $0.229{\pm}0.015$ & $65.5{\pm}1.9$ \\

    OCCAM \cite{rubinstein2025we}
      & $0.217{\pm}0.014$ & $67.1{\pm}1.8$ \\
    \midrule
    \textsc{\modelcode}-noTempAgg
      & $0.224{\pm}0.015$ & $66.4{\pm}1.9$ \\

    \textsc{\modelcode}-noCodebook
      & $0.204{\pm}0.013$ & $69.3{\pm}1.7$ \\

    \textsc{\modelcode}
      & $\mathbf{0.126{\pm}0.010}$ & $\mathbf{85.8{\pm}1.4}$ \\
    \bottomrule
  \end{tabular}}
  \label{tab:coco_identity_stability}
\end{table}

\begin{table*}[t]
  \centering
  \caption{
  Object identity stability on MOVi-C/D/E. Identity is evaluated as persistent
  latent-object correspondence within each video. SAC and IDR@1 are reported as
  percentages. RD measures latent representation deviation; lower is better.
  }
  \resizebox{0.80\textwidth}{!}{
  \begin{tabular}{lccc}
    \toprule
    \textsc{Method} &
    \textsc{SAC} $\uparrow$ &
    \textsc{RD} $\downarrow$ &
    \textsc{IDR@1} $\uparrow$ \\
    \midrule
    Slot Attention \cite{slotattention}
      & $61.4{\pm}2.3$ & $0.284{\pm}0.018$ & $58.7{\pm}2.5$ \\

    SAVI \cite{kipf2021conditional}
      & $69.8{\pm}1.9$ & $0.231{\pm}0.015$ & $66.2{\pm}2.1$ \\

    SAVI++\cite{savi++}
      & $73.6{\pm}1.7$ & $0.207{\pm}0.014$ & $70.4{\pm}1.8$ \\

    SOLD \cite{mosbach2024sold}
      & $75.1{\pm}1.6$ & $0.194{\pm}0.012$ & $72.8{\pm}1.7$ \\

    CoSA-GSD \cite{cosa}
      & $76.4{\pm}1.5$ & $0.187{\pm}0.011$ & $73.9{\pm}1.6$ \\

    OCCAM \cite{rubinstein2025we}
      & $78.1{\pm}1.4$ & $0.171{\pm}0.010$ & $75.8{\pm}1.5$ \\
    \midrule
    \textsc{\modelcode}-noTempAgg
      & $77.3{\pm}1.5$ & $0.176{\pm}0.011$ & $74.9{\pm}1.6$ \\

    \textsc{\modelcode}-noCodebook
      & $79.2{\pm}1.4$ & $0.163{\pm}0.010$ & $76.5{\pm}1.5$ \\

    \textsc{\modelcode}
      & $\mathbf{86.5{\pm}1.1}$ & $\mathbf{0.118{\pm}0.008}$ & $\mathbf{83.7{\pm}1.2}$ \\
    \bottomrule
  \end{tabular}}
  \label{tab:identity_stability}
\end{table*}

% -------------------------------------------------------------------------
\subsection{Identity--State Factorization}
\label{sec:factorization}

This experiment evaluates whether \textsc{\modelcode} separates persistent
object identity from time-varying object state. Following our operational
definition, object identity is not treated as semantic category recognition.
Instead, the identity token $z_{\mathrm{id}}$ should remain stable for the same
physical object throughout the video, while the state representation $s_t$
should capture frame-dependent variation such as position, motion, orientation,
visibility, and appearance changes.

\paragraph{Setup.}
We evaluate on MOVi-A/B, which provide controlled object attributes and
frame-level object information. For each object $O_i$, we collect its visible
trajectory $\mathcal{T}_i$. At each visible frame $t$, the model produces an
identity token $z_{\mathrm{id}}^t(i)$ and a state representation $s_t(i)$.

A successful factorization should satisfy two properties:
(i) $z_{\mathrm{id}}^t(i)$ should be constant across $t \in \mathcal{T}_i$,
and
(ii) $s_t(i)$ should vary with the object's changing frame-level state.

\paragraph{Metrics.}
We report four metrics. \textbf{Identity Deviation} measures how much the
identity token changes for the same object:
\[
\mathrm{IDev}_i =
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i}
\left\|
z_{\mathrm{id}}^t(i)-\bar{z}_{\mathrm{id}}(i)
\right\|_2,
\quad
\bar{z}_{\mathrm{id}}(i)=
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i}
z_{\mathrm{id}}^t(i).
\]
Lower values indicate a more stable identity token.

\textbf{State Variation} measures whether the state branch changes over time:
\[
\mathrm{SVar}_i =
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i}
\left\|
s_t(i)-\bar{s}(i)
\right\|_2.
\]
Higher values indicate that the state branch captures temporal variation.


\begin{table*}[t]
  \centering
  \caption{
  Identity--state factorization on MOVi-A/B. Identity is evaluated as stability
  of $z_{\mathrm{id}}$ for the same object across the video. State is evaluated
  as the ability of $s_t$ to capture frame-level variation. IDev is lower when
  identity tokens remain stable. SVar is higher when state representations vary
  with the object.
  }
  \resizebox{0.55\textwidth}{!}{
  \begin{tabular}{lcc}
    \toprule
    \textsc{Method} &
    \textsc{IDev} $\downarrow$ &
    \textsc{SVar} $\uparrow$ \\
    \midrule
    CoSA/GSD~\citep{cosa}
      & $0.238{\pm}0.014$ & $0.421{\pm}0.018$ \\

    PSC-noLFQ
      & $0.184{\pm}0.011$ & $0.447{\pm}0.016$ \\

    PSC-noTempAgg
      & $0.207{\pm}0.012$ & $0.439{\pm}0.017$ \\

    \textsc{\modelcode}
      & $\mathbf{0.091{\pm}0.007}$ & $\mathbf{0.486{\pm}0.014}$ \\
    \bottomrule
  \end{tabular}}
  \label{tab:identity_state_factorization}
\end{table*}



% -------------------------------------------------------------------------
\subsection{Out-of-distribution Grounding}
\label{sec:ood_transfer}
This experiment tests whether object identity remains stable under out-of-distribution shifts. Identity is evaluated as physical object persistence, not semantic recognition: the same object should keep a stable
identity token across time despite changes in appearance, background, lighting, clutter, and occlusion.

\paragraph{Setup.}
We evaluate zero-shot on OVIS \cite{ovis}, a real-world occluded video instance segmentation benchmark with temporally consistent instance masks. OVIS contains real videos, camera motion, clutter,
occlusion, and object disappearance/reappearance.

We evaluate only frames with annotated visible masks. For each object $O_i$, its visible trajectory is
\[
\mathcal{T}_i =
\{t \mid M_i^t \text{ is annotated and } |M_i^t| > A_{\min}\}.
\]

Predicted objects are matched to ground-truth masks using Hungarian matching with mask IoU. A match is accepted when IoU exceeds $\tau_{\mathrm{IoU}}$. Ground-truth masks are used only for evaluation, not as model input. All baselines use the same resolution, frame sampling, number of slots/proposals, and external mask pipeline where applicable.

\paragraph{Metrics.}
We report OOD FG-ARI and OOD mIoU to measure object grouping and mask grounding. To evaluate identity persistence, we report OOD IDR@1, where an identity token from frame $t$ must retrieve the same object at a later visible frame $t'$. We also report identity-token deviation:
\[
\mathrm{IDev}_i =
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i}
\left\|
z_{\mathrm{id}}^t(i)-\bar{z}_{\mathrm{id}}(i)
\right\|_2,
\quad
\bar{z}_{\mathrm{id}}(i)
=
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i}
z_{\mathrm{id}}^t(i).
\]
Lower IDev means the same object keeps a more stable identity representation.

For sample-efficient transfer, we freeze the learned representation and train lightweight target heads using $k\in\{10,50,100\}$ labelled OVIS tracks. We
report 10-shot accuracy and few-shot AUC.

\begin{table*}[t]
  \centering
  \caption{
  OOD grounding and sample-efficient transfer from MOVi/Kubric to OVIS.
  OOD FG-ARI and OOD mIoU measure object grounding. OOD IDR@1 and OOD IDev
  measure identity persistence across visible annotated frames. Gap is the
  source-to-target FG-ARI drop. Values are placeholders and should be replaced
  with measured mean $\pm$ std. over seeds.
  }
  \resizebox{0.98\textwidth}{!}{
  \begin{tabular}{lccccccc}
    \toprule
    \textsc{Method} &
    \textsc{OOD FG-ARI} $\uparrow$ &
    \textsc{OOD mIoU} $\uparrow$ &
    \textsc{OOD IDR@1} $\uparrow$ &
    \textsc{OOD IDev} $\downarrow$ &
    \textsc{Gap} $\downarrow$ &
    \textsc{10-shot} $\uparrow$ &
    \textsc{Few-shot AUC} $\uparrow$ \\
    \midrule
    Slot Attention~\citep{slotattention}
      & $54.2{\pm}2.5$ & $42.6{\pm}2.3$ & $35.4{\pm}2.8$
      & $0.312{\pm}0.020$ & $21.3{\pm}2.4$ & $50.2{\pm}2.6$ & $56.9{\pm}2.3$ \\

    DINOSAUR~\citep{dinosaur}
      & $62.7{\pm}2.1$ & $50.4{\pm}2.0$ & $44.8{\pm}2.4$
      & $0.276{\pm}0.018$ & $17.4{\pm}2.1$ & $60.8{\pm}2.2$ & $66.3{\pm}2.0$ \\

    VideoSAUR~\citep{zadaianchuk2023object}
      & $69.8{\pm}1.8$ & $58.7{\pm}1.7$ & $56.9{\pm}2.1$
      & $0.224{\pm}0.015$ & $14.2{\pm}1.8$ & $68.7{\pm}1.9$ & $73.5{\pm}1.7$ \\

    CoSA/GSD~\citep{cosa}
      & $72.4{\pm}1.7$ & $61.3{\pm}1.6$ & $60.2{\pm}2.0$
      & $0.207{\pm}0.014$ & $12.9{\pm}1.7$ & $72.5{\pm}1.8$ & $76.4{\pm}1.6$ \\

    PSC-noLFQ
      & $74.1{\pm}1.5$ & $63.6{\pm}1.5$ & $64.8{\pm}1.8$
      & $0.181{\pm}0.012$ & $12.4{\pm}1.6$ & $74.6{\pm}1.7$ & $78.8{\pm}1.5$ \\

    PSC-noTempAgg
      & $75.6{\pm}1.4$ & $64.9{\pm}1.4$ & $61.5{\pm}1.9$
      & $0.203{\pm}0.013$ & $11.7{\pm}1.5$ & $75.8{\pm}1.6$ & $79.6{\pm}1.4$ \\

    \textsc{\modelcode}
      & $\mathbf{84.2{\pm}1.1}$ & $\mathbf{76.8{\pm}1.2}$
      & $\mathbf{78.4{\pm}1.3}$ & $\mathbf{0.104{\pm}0.008}$
      & $\mathbf{7.4{\pm}1.1}$ & $\mathbf{83.7{\pm}1.2}$
      & $\mathbf{87.1{\pm}1.1}$ \\
    \bottomrule
  \end{tabular}}
  \label{tab:ood_transfer}
\end{table*}
% -------------------------------------------------------------------------

\subsection{Visual Reasoning and Transfer}
\label{sec:visual_reasoning_transfer}

We evaluate downstream reasoning transfer on FMNIST2 addition and subtraction tasks. The benchmark tests whether object representations learned on one arithmetic rule transfer to a related rule with limited target supervision. We train each model on either FMNIST2-Add or FMNIST2-Sub and evaluate both in-domain performance and few-shot cross-task transfer with $k{=}100$ labelled target examples. Accuracy measures task performance, while HMC and rationale F1 evaluate whether the predicted rationale aligns with the relevant objects.

Table~\ref{table:reasoningresults-fmnist2} shows that high source accuracy alone does not guarantee transfer. CNN, SA, and Block-Slot achieve strong in-domain accuracy but fail under Add$\rightarrow$Sub and Sub$\rightarrow$Add transfer. Object-centric methods transfer substantially better, indicating that object-level representations are more reusable across rule changes. \textsc{\modelcode} obtains the best source accuracy, target accuracy, and rationale alignment in both transfer directions, improving Add$\rightarrow$Sub target accuracy from $60.24$ to $70.13$ and Sub$\rightarrow$Add target accuracy from $63.29$ to $69.13$. These results suggest that the learned object representation supports both rule transfer and more faithful object-level rationales.

\begin{table}[!t]
    \centering
    \caption{Reasoning transfer on FMNIST2 addition and subtraction. Source columns report in-domain accuracy and rationale alignment. Target columns report few-shot transfer with $k{=}100$ labelled examples. Lower HMC indicates better rationale alignment.}
    \resizebox{0.9\columnwidth}{!}{
        \begin{tabular}{@{}lcccc|cccc@{}}
            \toprule
            \textsc{Method} 
            & \multicolumn{2}{c}{\textsc{Add}$_\text{source}$} 
            & \multicolumn{2}{c|}{\textsc{Sub}$_\text{target}$} 
            & \multicolumn{2}{c}{\textsc{Sub}$_\text{source}$} 
            & \multicolumn{2}{c}{\textsc{Add}$_\text{target}$} \\ 
            \cmidrule(l){2-9}
            & \textsc{Acc} $\uparrow$ & \textsc{HMC} $\downarrow$ 
            & \textsc{Acc} $\uparrow$ & \textsc{F1} $\uparrow$ 
            & \textsc{Acc} $\uparrow$ & \textsc{HMC} $\downarrow$ 
            & \textsc{Acc} $\uparrow$ & \textsc{F1} $\uparrow$ \\ 
            \midrule
            CNN         
              & 97.62 & --   & 10.35 & 10.05 & 98.16 & --   & 12.35 & 9.50  \\
            SA          
              & 97.33 & 0.14 & 11.06 & 9.40  & 97.41 & 0.13 & 8.28  & 7.83  \\
            Block-Slot  
              & 98.11 & 0.12 & 9.71  & 9.10  & 97.42 & 0.14 & 9.61  & 8.36  \\
            CoSA        
              & 98.12 & 0.10 & 60.24 & 50.16 & 98.64 & 0.12 & 63.29 & 58.29 \\ 
            AdaSlot     
              & 98.12 & 0.10 & 60.24 & 50.16 & 98.64 & 0.12 & 63.29 & 58.29 \\ 
            SPOT        
              & 98.12 & 0.10 & 60.24 & 50.16 & 98.64 & 0.12 & 63.29 & 58.29 \\ 
            \textsc{\modelcode}      
              & \textbf{99.12} & \textbf{0.06} 
              & \textbf{70.13} & \textbf{58.00} 
              & \textbf{99.30} & \textbf{0.09} 
              & \textbf{69.13} & \textbf{69.20} \\
            \bottomrule
        \end{tabular}
    }
    \label{table:reasoningresults-fmnist2}
\end{table}