\clearpage
\section{Datasets}
\label{appsec:datasets}

This appendix summarizes the datasets used in the experiments. The datasets are
organized according to the four evaluation settings in Sec.~\ref{sec:exps}:
object identity stability, identity--state factorization, out-of-distribution
grounding and sample-efficient transfer, and visual reasoning transfer.

% -------------------------------------------------------------------------
\subsection{Datasets for Object Identity Stability}

\subsubsection{MS COCO}
MS COCO~\cite{coco} is used to evaluate category-level representation stability
on real-world images. Since COCO contains static images and does not provide
temporal object tracks, we do not use it to evaluate physical identity
persistence over time. Instead, object instances are grouped by semantic
category. For each image, ground-truth instance masks are matched to predicted
slots using mask IoU, and the corresponding slot embeddings are used to measure
category-level representation stability.

\subsubsection{MOVi-C}
MOVi-C~\cite{movi-e} is used to evaluate temporal object identity stability in
synthetic multi-object videos. It contains scenes with multiple moving objects
and provides frame-level instance masks, visibility annotations, and persistent
object identifiers. These annotations allow us to evaluate whether the same
physical object remains assigned to a consistent slot and maintains a stable
latent representation across frames.

\subsubsection{MOVi-E}
MOVi-E~\cite{movi-e} is used as a more challenging temporal identity benchmark.
Compared with MOVi-C, MOVi-E contains denser scenes with more objects per video.
The dataset provides persistent object IDs, frame-level masks, and visibility
information, enabling evaluation of slot assignment consistency, representation
deviation, and identity retrieval across time.

% -------------------------------------------------------------------------
\subsection{Datasets for Identity--State Factorization}

\subsubsection{MOVi-A}
MOVi-A is used to evaluate identity--state factorization under controlled
synthetic conditions. It provides object-level information that allows us to
measure whether the identity token remains stable for the same physical object
while the state representation varies with frame-dependent changes.

\subsubsection{MOVi-B}
MOVi-B extends the controlled setting of MOVi-A with additional object and
motion variability. We use MOVi-A/B to test whether
$\mathbf{z}_{\mathrm{id}}$ captures persistent object identity and whether
$s_t$ captures time-varying state factors such as position, motion,
orientation, visibility, and appearance changes.

% -------------------------------------------------------------------------
\subsection{Dataset for Out-of-distribution Grounding and Transfer}

\subsubsection{OVIS}
OVIS~\cite{ovis} is used for zero-shot out-of-distribution grounding and
sample-efficient transfer. It is a real-world occluded video instance
segmentation benchmark containing camera motion, clutter, object occlusion, and
object disappearance/reappearance. OVIS provides temporally consistent instance
annotations, which allow us to evaluate object grouping, mask grounding,
identity-token stability, and identity retrieval across visible annotated
frames.

For the few-shot transfer setting, we freeze the learned representation and
train lightweight target heads using a small number of labelled OVIS tracks,
with $k \in \{10,50,100\}$.

% -------------------------------------------------------------------------
\subsection{Visual Reasoning and Transfer Benchmark}

\subsubsection{FloatingMNIST-2}
FloatingMNIST-2 (FMNIST2) is used to evaluate downstream visual reasoning and
few-shot rule transfer. Each sample contains two MNIST digits placed on a
$64 \times 64$ canvas. We use two task variants:

\begin{itemize}
    \item \textbf{FMNIST2-Add}: the target is the sum of the two digits;
    \item \textbf{FMNIST2-Sub}: the target is the absolute difference between
    the two digits.
\end{itemize}

Models are trained on either FMNIST2-Add or FMNIST2-Sub and evaluated on both
in-domain performance and few-shot cross-task transfer with $k{=}100$ labelled
target examples.

% -------------------------------------------------------------------------
\subsection{Evaluation Protocol and Preprocessing}

For video datasets, we sample fixed-length clips and resize frames to the input
resolution used by the model. Pixel intensities are normalized to $[0,1]$.
Unless otherwise specified, ground-truth instance masks and object IDs are used
only for evaluation and not as direct supervision during representation
learning.

For MS COCO, evaluation is performed at the instance level and then aggregated
by semantic category. For MOVi-C and MOVi-E, identity stability is evaluated
within each video using persistent object IDs and visible frame-level masks. For
MOVi-A/B, identity--state factorization is evaluated by measuring the stability
of $\mathbf{z}_{\mathrm{id}}$ and the temporal variation of $s_t$. For OVIS,
predicted objects are matched to annotated masks using mask IoU, and identity
persistence is measured across visible annotated frames.

Additional implementation details and metric definitions are provided in
Appendix~\ref{appsec:metrics}.