Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era

Published: 07 May 2025, Last Modified: 29 May 2025VisCon 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal, linear probing, interpretability, concepts, attributes
TL;DR: Interpretability study comparing the concept representations from vision-only, language-only, and multimodal encoders, in terms of their ability to predict semantic attributes; results show very comparable performance across modalities.
Abstract: Accurate understanding of a concept includes representing the common attributes and affordances of that concept across multiple modalities. We investigate the ability of pre-trained vision models to represent the semantic attributes of concrete object concepts, e.g. a ROSE "is red", "smells sweet", and "is a flower". More specifically, we use probing tasks to test which properties of objects these models are aware of. We evaluate image encoders trained on image data alone, as well as multimodally-trained image encoders and language-only models, on predicting an extended denser version of the classic McRae semantic attribute norms, widely used in NLP, and the newer Binder dataset of attribute ratings. We find that multimodal image encoders slightly outperform language-only approaches, and that image-only encoders perform comparably to language models, even on non-visual attributes that are classified as “encyclopedic” or “function”. These results offer new insights into what can be learned from pure unimodal learning, and the complementarity of the modalities. (Results, code, and data are available on the project webpage: https://danoneata.github.io/seeing-what-tastes-good/)
Submission Number: 34
Loading