Words That Make Language Models Perceive

Sophie L. Wang; Phillip Isola; Brian Cheung

Words That Make Language Models Perceive

Sophie L. Wang, Phillip Isola, Brian Cheung

16 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: platonic representation hypothesis, perception, large language models

TL;DR: A simple cue like asking the model to ‘see’ or ‘hear’ can push a purely text-trained language model towards the representations of a purely image-trained and purely-audio trained encoders.

Abstract: Large language models (LLMs) trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a text‑only LLM into closer representational alignment with specialist vision and audio encoders. When a sensory prompt tells the model to 'see' or 'hear', it cues the model to resolve its next‑token predictions as if they were conditioned on latent visual or auditory evidence that is never actually supplied. Our findings reveal that lightweight prompt engineering can reliably activate modality‑appropriate representations in purely text‑trained LLMs.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 7909

Loading