TL;DR: This work seeks to identify the variables that drive the consistency of pairwise representational similarities across different datasets for a wide range of vision foundation models.
Abstract: The Platonic Representation Hypothesis claims that recent foundation models are converging to a shared representation space as a function of their downstream task performance, irrespective of the objectives and data modalities used to train these models (Huh et al., 2024). Representational similarity is generally measured for individual datasets and is not necessarily consistent across datasets. Thus, one may wonder whether this convergence of model representations is confounded by the datasets commonly used in machine learning. Here, we propose a systematic way to measure how representational similarity between models varies with the set of stimuli used to construct the representations. We find that the objective function is a crucial factor in determining the consistency of representational similarities across datasets. Specifically, self-supervised vision models learn representations whose relative pairwise similarities generalize better from one dataset to another compared to those of image classification or image-text models. Moreover, the correspondence between representational similarities and the models' task behavior is dataset-dependent, being most strongly pronounced for single-domain datasets. Our work provides a framework for analyzing similarities of model representations across datasets and linking those similarities to differences in task behavior.
Lay Summary: Many computer vision models (programs that analyze images) process visual information similarly: they convert images into compact numerical descriptions called representations. These representations tend to be organized similarly across different models. We ask whether this similarity reveals a shared understanding of the visual world or reflects the specific image collections (evaluation datasets) used for measuring similarity.
We test how consistent model similarities are when the evaluation dataset changes. Using 64 vision models, we look at how their representations change across various types of images, from everyday objects to specialized domains. We find that similarities are fairly consistent across many datasets for models trained using self-supervised learning (which learn visual patterns without additional information about the image content). However, models trained with paired text descriptions or object categorization tasks exhibited similarities that varied more depending on the evaluation dataset.
To help researchers better understand which models perceive the world similarly, we introduce a structured way of comparing how model representation similarity changes across datasets. This matters because models with different similarities across datasets likely do not have a shared understanding of the world. Our results show that the stability of similarities between models depends on the training task.
Link To Code: https://github.com/lciernik/similarity_consistency
Primary Area: Deep Learning->Other Representation Learning
Keywords: representation learning, factors influencing representational similarities, computer vision, vision foundation models, representational similarity across stimuli
Submission Number: 3134
Loading