Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era
Abstract: There is widespread agreement about the grounded nature of human learning and representation, and the belief that computational
models of meaning need to be multimodal. In this paper, we ask to what degree does this belief hold in the era of models trained on billions of examples? We investigate the ability of pre-trained vision models to represent the semantic feature norms of concrete object concepts, e.g. a ROSE is red, smells sweet, and is a flower. More specifically, we use probing tasks to test which properties of objects these models are aware of. We evaluate image encoders trained on image data alone, as well as multimodally-trained image encoders and language-only models, on predicting an extended set of the classic McRae norms and the newer Binder dataset of attribute ratings. We find that multimodal image encoders slightly outperform language-only approaches, and that image-only encoders perform comparably to
the language models, even on non-visual attributes that are classified as “encyclopedic” or “function”. These results offer new insights into
what can be learned from pure unimodal learning, and the complementarity of the modalities.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, distributional semantics, vision models, language models, semantic norms
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 3559
Loading