Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era

Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era

ACL ARR 2025 February Submission3559 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: There is widespread agreement about the grounded nature of human learning and representation, and the belief that computational models of meaning need to be multimodal. In this paper, we ask to what degree does this belief hold in the era of models trained on billions of examples? We investigate the ability of pre-trained vision models to represent the semantic feature norms of concrete object concepts, e.g. a ROSE is red, smells sweet, and is a flower. More specifically, we use probing tasks to test which properties of objects these models are aware of. We evaluate image encoders trained on image data alone, as well as multimodally-trained image encoders and language-only models, on predicting an extended set of the classic McRae norms and the newer Binder dataset of attribute ratings. We find that multimodal image encoders slightly outperform language-only approaches, and that image-only encoders perform comparably to the language models, even on non-visual attributes that are classified as “encyclopedic” or “function”. These results offer new insights into what can be learned from pure unimodal learning, and the complementarity of the modalities.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality, distributional semantics, vision models, language models, semantic norms

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 3559

Loading