\section{Related Work}
\label{sec:related-work}

\paragraph{Object-Centric Learning.}
Object-centric learning (OCL) represents scenes as compositions of objects. Early methods such as MONET \cite{burgess2019monet}, IODINE \cite{greff2019multi}, GENESIS \cite{engelcke2020reconstruction}, and related compositional generative models \cite{greff2017neural,kosiorek2018sequential,lin2020space} learn object-like factors through iterative inference or structured decoding. Slot Attention \cite{slotattention} made this paradigm scalable and inspired extensions for images and video \cite{Genesis-v2,slotvae,blockslot,savi++,singh2021illiterate,slate,dinosaur}. While effective on controlled benchmarks, these methods often rely on exchangeable slots, which can make object identity unstable across frames, scenes, and viewpoints \cite{greff2020binding}. Recent work also suggests that, with strong class-agnostic segmenters, object discovery itself may no longer be the main bottleneck \cite{rubinstein2025we,hqes}. Our work is complementary to such proposal or tracking pipelines: rather than solving proposal generation, we study how to convert grounded object observations into temporally persistent object representations.

\paragraph{Discrete Representation Learning.}
Discrete latent-variable models map continuous features to discrete codes, as in VQ-VAE~\cite{van2017neural}, Gumbel-Softmax~\cite{jang2016categorical,maddison2016concrete}, and later codebook-based extensions~\cite{esser2021taming,gu2022vector,ramesh2021zero}. In PSC, we adopt the lookup-free tokenizer introduced in MAGVIT-v2~\cite{yu2023language}, which replaces explicit codebook lookup with binary lookup-free quantization. While MAGVIT-v2 uses this tokenizer for image/video generation and compression, PSC repurposes it for object-centric representation learning: temporally aggregated object evidence is mapped to a reusable discrete identity token, which initializes the identity branch and is paired with a continuous state vector for reconstruction and temporal reasoning.

\paragraph{Compositional Reasoning and Grounded Representations.}
Compositional reasoning benefits from object representations that are stable and reusable across scenes \cite{greff2020binding}. Prior OCL methods can decompose scenes into object-like factors \cite{greff2017neural,kosiorek2018sequential,crawford2019spatially,burgess2019monet,greff2019multi,lin2020space,slotattention,engelcke2020reconstruction,emami2021efficient,kipf2021conditional,dinosaur,savi++}, but exchangeable slots do not explicitly enforce persistent identity. PSC addresses this limitation by factorizing each grounded object into a discrete identity code and a continuous state variable, with explicit temporal aggregation across frames. This makes PSC particularly suited to settings where the main challenge is not discovering objects from scratch, but maintaining consistent identity for downstream temporal reasoning and compositional transfer \cite{hu2024visual,acuna2025socratic}.

\paragraph{Segmentation and tracking foundation models.}
Recent foundation models such as SAM~2~\cite{sam2} and SAM~3~\cite{sam3} improve promptable video segmentation and concept-based segmentation, respectively. These models are designed to produce masks or tracks, but they do not by themselves enforce a persistent latent identity representation for an object across time. PSC addresses a different problem: given grounded object observations, it learns a factorized representation with a reusable discrete identity code and a time-varying state, so that the same object can maintain a stable representation across frames despite changes in pose, motion, or appearance.

\paragraph{Superpixel-Guided State Encoding.}
Recent works use superpixels or region structure for sparse spectral--spatial coding \cite{fan2017superpixel}, probabilistic color modeling \cite{lin2018new}, or cross-modal self-supervision \cite{wang2025confidence}, while related SSL and causal multi-modal methods focus on universal or causally grounded representations \cite{qiang2024universality,wang2024towards}. In contrast, PSC uses superpixels inside the \emph{state encoder}: ViT patch features are organized through a patch-affinity graph and pooled into object-level state latents. This allows \modelcode\ to separate persistent identity from time-varying state, giving it an advantage in dynamic object-centric video settings where pose, motion, appearance, and occlusion change over time.