Conditional Random Fields for Structured Representation Learning from Pretrained Features

Published: 30 May 2026, Last Modified: 01 Jun 2026SPIGM @ ICML PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: conditional random fields, probabilistic inference, object-centric, representation learning
Abstract: Pretrained vision transformers encode rich semantic and perceptual structure, but this structure is not inherently organized into structured object-centric representations. Slot Attention (SA) is a widely used approach for extracting such representations by performing trainable clustering over frozen pretrained features and reconstructing them from cluster centroids, referred to as slots. In this work, we investigate whether classical probabilistic structured prediction methods can improve structured latent representation learning. Specifically, we incorporate Conditional Random Fields (CRFs) using SA assignment scores as unary potentials and pairwise terms derived from spatial proximity and similarity in frozen DINO features. We apply CRFs both to refine the attention scores produced by SA and to the decoder cross-attention layers, where the compatibility function is learned from slot representations. Experiments on real-world image datasets demonstrate substantial improvements over baselines, achieving state-of-the-art performance in unsupervised object discovery.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 173
Loading