\section{Discussion, Limitations, Conclusion, and Future Work}
\label{sec:discussion_future}
\paragraph{Discussion.}
Our results suggest that \textsc{\modelcode} provides a useful mechanism for making object identity more persistent than in exchangeable-slot models. By combining grounded object proposals, temporal aggregation, and LFQ identity tokens, the model separates relatively stable object identity from time-varying state factors such as pose, position, visibility, and local appearance. This distinction is important because accurate reconstruction or segmentation does not necessarily imply stable identity assignment over time. The factorization also suggests a broader form of compositional reuse: while identity codes are object-specific, some state factors, such as ``tilted left'', ``partially occluded'', or ``moving upward'', are not object-specific and could be reused across different object identities.

\paragraph{Limitations.}
Several limitations remain. First, \textsc{\modelcode} depends on external object proposals and an object-count prior; missed, merged, or fragmented masks can propagate errors into both identity and state representations. Second, although our evaluation includes identity-oriented video metrics, the current experiments are still limited in the length and complexity of the videos considered. Stronger validation on longer, high-resolution, and unconstrained real-world videos is needed to better support claims about long-horizon identity persistence. Third, the current version does not include sufficient qualitative analysis of successes and failures. Important failure cases to visualize include identity swaps between similar objects, identity drift after occlusion, incorrect assignments caused by mask errors, and possible leakage of identity information into the state representation. Finally, the state representation is not explicitly constrained to be object-agnostic, so reusable state concepts may still be entangled with object identity.

\paragraph{Future Work.}
Future work should first evaluate \textsc{\modelcode} on longer and more complex real-world video datasets with camera motion, clutter, occlusion, object disappearance and re-entry, and visually similar instances. Second, future versions should include qualitative visualizations of masks, memory slots, identity tokens, reconstructions, and identity trajectories to better diagnose where the method succeeds or fails. Third, the state branch could be extended into a reusable vocabulary of object-agnostic state primitives, allowing factors such as orientation, motion, and visibility to transfer across object categories. Finally, an important direction is to reduce dependence on external masks by developing a self-supervised or proposal-light version that learns object discovery, temporal correspondence, and identity--state separation directly from raw video.