Keywords: visual simulation, multimodal large language models, grounded cognition, vision-language models, embodied simulation
TL;DR: This paper defines visual simulation in multimodal language models, identifies architectural and training components to enable it, and proposes methods to detect it.
Abstract: Inspired by extensive evidence that humans use *visual simulation* to understand parts of language, this position paper explores analogous behavior in multimodal language models. Although grounding models in vision has been an active area of research in the AI community, visual simulation remains underexplored. We address this gap by formally defining visual simulation, identifying the architectural and training components to enable it, and proposing a multi-pronged approach to determine whether a model engages in visual simulation.
Submission Number: 42
Loading