How PARTs assemble into wholes: Learning the relative composition of images

Published: 05 Nov 2025, Last Modified: 05 Nov 2025NLDL 2026 SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-supervised, Masked Image Modeling, Transformer, Off-grid patch sampling, Relative Translations, Object detection, EEG signals
Abstract: The composition of objects and their parts, along with object-object positional relationships, provides a rich source of information for representation learning. Hence, spatial-aware pretext tasks have been actively explored in self-supervised learning. Existing works commonly start from a grid structure, where the goal of the pretext task involves predicting the absolute position index of patches within a fixed grid. However, grid-based approaches fall short of capturing the fluid and continuous nature of real-world object compositions. We introduce PART, a self-supervised learning approach that leverages continuous relative transformations between off-grid patches to overcome these limitations. By modeling how parts relate to each other in a continuous space, PART learns the relative composition of images—an off-grid structural relative positioning that generalizes beyond occlusions and deformations. In tasks requiring precise spatial understanding such as object detection and time series prediction, PART outperforms grid-based methods like MAE and DropPos, while maintaining competitive performance on global classification tasks. By breaking free from grid constraints, PART opens up a new trajectory for universal self-supervised pretraining across diverse datatypes—from images to EEG signals—with potential in medical imaging, video, and audio.
Serve As Reviewer: ~Melika_Ayoughi1
Submission Number: 19
Loading