Smoothing Slot Attention Iterations and Recurrences

17 Sept 2025 (modified: 26 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Object-Centric Learning, Slot Attention, Object Discovery, Object Recognition, Dynamics Modeling
TL;DR: We address issues of query cold-start in slot attention iterations on the image or video's first frame and transform homogeneity in slot attention recurrences on the video frames, improving image and video OCL significantly.
Abstract: Slot Attention (SA) and its variants lie at the heart of mainstream Object-Centric Learning (OCL). Objects in an image can be aggregated into corresponding slot vectors, by \textit{iteratively} refining cold-start query vectors, typically three times, via SA on image features. For video, this aggregation is \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots on non-first frames. However, cold-start queries lack sample-specific cues thus hindering precise aggregation on an image or a video's first frame; Also, non-first frames' queries are already sample-specific thus requiring aggregation transforms different from the first frame. We address these issues for the first time with our \textit{SmoothSA}: (1) To smooth SA iterations on the image or video's first frame, we \textit{preheat} the cold-start queries with rich information of input features, via a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across all video frames, we \textit{differentiate} the homogeneous transforms on the first and non-first frames, by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and downstream benchmarks validate our method's effectiveness. Further analyses illuminate how our method smooths SA iterations and recurrences. Our source code and training logs are provided in the supplement.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 9379
Loading