% -------------------------------------------------------------------------
\clearpage
\section{Evaluation Metrics}
\label{appsec:metrics}

This appendix defines the evaluation metrics used in Sec.~\ref{sec:exps}. The
metrics are organized according to the four experimental settings: object
identity stability, identity--state factorization, out-of-distribution grounding
and sample-efficient transfer, and visual reasoning transfer.

For video datasets, identity-related metrics are computed within each video
using the dataset-provided object annotations. MOVi and OVIS object identifiers
are persistent within a video but are not shared across videos; therefore,
these metrics evaluate within-video physical object identity rather than
cross-video semantic recognition. For MS COCO, which contains static images and
does not provide temporal object tracks, identity stability is evaluated at the
category level.

% -------------------------------------------------------------------------
\subsection{Object--Slot Matching}
\label{appsec:object_slot_matching}

For each visible object $O_i$ at frame $t$, we assign the object to the
predicted slot with maximum mask overlap:
\[
s_t(i)=\arg\max_k \operatorname{IoU}(M_i^t,A_k^t),
\]
where $M_i^t$ is the ground-truth object mask and $A_k^t$ is the predicted slot
mask or attention map. The set of visible frames for object $O_i$ is denoted by
$\mathcal{T}_i$.

For MS COCO, the same matching rule is applied to static images by matching
each ground-truth instance mask to the predicted slot with the highest mask IoU.
Since COCO does not contain temporal tracks, the resulting slot embeddings are
used only for category-level representation stability.

For OVIS, predicted objects are matched to ground-truth instance masks using
Hungarian matching with mask IoU. A match is accepted only when the IoU exceeds
the threshold $\tau_{\mathrm{IoU}}$.

% -------------------------------------------------------------------------
\subsection{Slot Assignment Consistency}
\label{appsec:sac}

Slot Assignment Consistency (SAC) measures whether the same physical object is
assigned to the same slot throughout its visible trajectory. For each object
$O_i$, we first compute its dominant slot:
\[
s_i^\star =
\operatorname{mode}(\{s_t(i)\}_{t\in\mathcal{T}_i}).
\]
SAC is then defined as:
\[
\mathrm{SAC}_i =
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i}
\mathbb{1}[s_t(i)=s_i^\star].
\]
Higher SAC indicates stronger object-slot persistence. SAC is reported for the
MOVi temporal identity stability experiment.

% -------------------------------------------------------------------------
\subsection{Representation Deviation}
\label{appsec:rd}

Representation Deviation (RD) measures how much the representation associated
with the same object changes across observations. Let $z_i^t$ be the
representation of object $O_i$ at frame $t$, obtained from its matched slot.
The mean representation is:
\[
\bar{z}_i =
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i} z_i^t.
\]
RD is then:
\[
\mathrm{RD}_i =
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i}
\|z_i^t-\bar{z}_i\|_2.
\]
Lower RD indicates a more stable object representation.

For MOVi-C and MOVi-E, RD measures temporal representation stability for the
same physical object. For MS COCO, RD is computed over matched object
representations grouped by semantic category, since temporal identity tracks
are unavailable.

% -------------------------------------------------------------------------
\subsection{Identification Rate}
\label{appsec:idr}

Identification Rate at rank 1 (IDR@1) evaluates whether an object
representation retrieves the correct identity under nearest-neighbour matching.

For temporal video datasets, given a query representation $z_i^t$, the retrieved
object at a later visible frame $t+\Delta$ is:
\[
\hat{i} =
\arg\max_j \cos(z_i^t,z_j^{t+\Delta}).
\]
The retrieval is correct if $\hat{i}=i$:
\[
\mathrm{IDR@1} =
\frac{1}{|\mathcal{Q}|}
\sum_{(i,t,\Delta)\in\mathcal{Q}}
\mathbb{1}
\left[
\arg\max_j \cos(z_i^t,z_j^{t+\Delta})=i
\right].
\]
Here, $\mathcal{Q}$ is the set of valid query pairs where the object is visible
at both frames. Higher IDR@1 indicates stronger identity preservation.

For MS COCO, IDR@1 is computed at the category level: a retrieved
representation is counted as correct when it belongs to the same semantic
category as the query representation.

% -------------------------------------------------------------------------
\subsection{Identity--State Factorization Metrics}
\label{appsec:factorization_metrics}

Identity Deviation (IDev) measures how much the identity token changes for the
same physical object over time:
\[
\mathrm{IDev}_i =
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i}
\left\|
z_{\mathrm{id}}^t(i)-\bar{z}_{\mathrm{id}}(i)
\right\|_2,
\]
where
\[
\bar{z}_{\mathrm{id}}(i)=
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i}
z_{\mathrm{id}}^t(i).
\]
Lower IDev means that the identity token remains more stable across the video.

State Variation (SVar) measures whether the state representation changes over
time:
\[
\bar{s}(i)=
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i}
s_t(i),
\]
\[
\mathrm{SVar}_i =
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i}
\left\|
s_t(i)-\bar{s}(i)
\right\|_2.
\]
Higher SVar indicates that the state branch captures temporal changes such as
position, motion, orientation, visibility, and appearance variation.

The identity--state factorization experiment reports IDev and SVar on
MOVi-A/B.

% -------------------------------------------------------------------------
\subsection{Out-of-distribution Grounding and Transfer Metrics}
\label{appsec:ood_metrics}

OOD FG-ARI measures foreground object grouping quality on OVIS. It evaluates
how well predicted object groupings align with annotated foreground object
instances, ignoring background pixels. Higher OOD FG-ARI indicates better
object grouping under distribution shift.

OOD mIoU measures mask grounding quality on OVIS. It is computed between
matched predicted masks and ground-truth instance masks. Higher OOD mIoU
indicates better spatial grounding.

OOD IDR@1 applies the identification-rate metric to OVIS. An identity token
from frame $t$ must retrieve the same object at a later visible annotated frame
$t'$. Higher OOD IDR@1 indicates stronger identity persistence under real-world
appearance, clutter, camera motion, and occlusion shifts.

OOD IDev applies the identity-deviation metric to OVIS:
\[
\mathrm{IDev}_i =
\frac{1}{|\mathcal{T}_i|}
\sum_{t\in\mathcal{T}_i}
\left\|
z_{\mathrm{id}}^t(i)-\bar{z}_{\mathrm{id}}(i)
\right\|_2.
\]
Lower OOD IDev means that the same object keeps a more stable identity
representation across visible annotated frames.

The source-to-target generalization gap is reported as the FG-ARI drop from the
source setting to OVIS:
\[
\mathrm{Gap}
=
\mathrm{FG\text{-}ARI}_{\mathrm{source}}
-
\mathrm{FG\text{-}ARI}_{\mathrm{OVIS}}.
\]
Lower gap indicates better out-of-distribution transfer.

For sample-efficient transfer, the learned representation is frozen and a
lightweight target head is trained using $k\in\{10,50,100\}$ labelled OVIS
tracks. The 10-shot score reports target accuracy with $k=10$, and Few-shot AUC
summarizes target accuracy across the tested values of $k$.

% -------------------------------------------------------------------------
\subsection{Visual Reasoning Transfer Metrics}
\label{appsec:reasoning_metrics}

For FMNIST2 addition and subtraction, accuracy measures whether the model
predicts the correct arithmetic output.

HMC measures rationale misalignment between the model-selected evidence and the
ground-truth relevant objects. Lower HMC indicates better rationale alignment.

Rationale F1 measures the overlap between the predicted relevant objects and
the ground-truth relevant objects. Higher F1 indicates more faithful
object-level reasoning.

In the FMNIST2 transfer experiments, source accuracy and HMC are reported for
the in-domain task, while target accuracy and rationale F1 are reported for
few-shot Add$\rightarrow$Sub and Sub$\rightarrow$Add transfer.