Keywords: multimodality, language models, sentence processing, surprisal, reading times, word meaning, grounded semantics
TL;DR: When "watch" is highly expected in a context, LLMs do not show lower surprisal for "compass" (unexpected in context, visually similar to a watch) compared to "dog" (unexpected, dissimilar visually). We are testing if humans do.
Abstract: The meaning representations that humans construct for words capture both linguistic and multi-modal sensorimotor information. To investigate to what extent multimodal pre-activation influences linguistic expectations during sentence processing, we here describe a data-driven experimental setup—with materials normed for plausibility, visual and co-occurrence similarity—that orthogonally manipulates multimodality (sensorimotor similarity between word vectors) and linguistic predictability (Cloze probability). We hypothesized that high sensorimotor similarity to the likeliest Cloze completion should result in decreased processing effort in LLMs and humans, even when a word is not predictable from the linguistic context. We did not find such an effect in both language-only and vision-language LLM surprisal. We are currently conducting a self-paced reading study to investigate whether, contrary to the LLMs, visual similarity influences human reading times. We will then determine whether humans’ online processing of plausible sentences entails a multimodal dimension that goes beyond Cloze predictability.
Submission Number: 10
Loading