Abstract: Human perception and cognition rely on the ability to decompose complex scenes into object-centric representations. Recent advances in unsupervised object segmentation have enabled the discovery of structured, interpretable representations that improve performance across a wide range of computer vision tasks. While object-centric learning has shown promising results, existing methods often struggle with generalization across diverse domains, typically being limited to either synthetic or controlled real-world settings. Slot Attention has become a dominant architecture for object-centric representation learning. An alternative approach based on Gaussian mixture modeling—known as the Slot Mixture Module (SMM)—offers conceptual advantages by replacing cross-attention with probabilistic mixture components. However, SMM exhibits unstable convergence and limited performance on real-world data. In this work, we propose a novel training mechanism for slot-based models that significantly improves stability and adaptability. Our method introduces a hierarchical key aggregation strategy in the encoder, combined with an iterative refinement scheme within each slot update step. The pseudo-attention weights and mixture component parameters are modeled probabilistically and refined through a stabilized optimization procedure. https://github.com/VovaFrolow/rapid.
External IDs:dblp:conf/hais/FrolovVUP25
Loading