\clearpage
\section{Training and Computational Details}
\label{appsec:computation}

We train all models using the Adam optimizer with a learning rate of $4\times 10^{-4}$ and a batch size of 16. Unless otherwise noted, training uses early stopping with a patience of 5 and runs for at most $\min(40\text{ epochs}, 40{,}000\text{ steps})$. We apply a linear learning-rate warmup for the first 10{,}000 steps, followed by a Reduce-on-Plateau scheduler with decay factor 0.5. This configuration is used for the main PSC model as well as the comparison models, unless the original implementation of a baseline requires a different scheduler for stable reproduction.

For \textsc{\modelcode}, training is performed on fixed-length video clips, since the identity branch aggregates object-level representations across time. During training, input frames are resized to the task-specific resolution described in Appendix~\ref{appsec:datasets}, normalized to $[0,1]$, and converted into object-focused inputs using the mask-generation pipeline described in Sec.~\ref{sec:method}. Unless otherwise noted, we use the same optimization hyperparameters for the object-discovery experiments and the downstream FMNIST2 transfer experiments, with task-specific heads trained on top of the learned object representations.

We run all experiments on a compute cluster equipped with NVIDIA Tesla T4 16GB GPUs and Intel(R) Xeon(R) Gold 6230 CPUs. Throughput is reported in iterations per second (it/s). These measurements may vary slightly across systems depending on I/O load, background processes, and implementation details, so they should be interpreted as approximate efficiency indicators rather than absolute hardware-independent benchmarks.

To support reproducibility, we will release the experimental configuration files, training scripts, inference scripts, and dataset-generation scripts used in our experiments.

\paragraph{Implementation notes.}
For fair comparison, we use the official implementations of the baselines whenever available and preserve their default architectural components unless changes are required to match input resolution or evaluation protocol. When a baseline is adapted to our setting, we report this explicitly in the corresponding experiment section. For throughput reporting, all models are evaluated under the same batch size and hardware configuration.

