Look, Then Speak: Social Tokens for Grounding LLMs in Visual Interactions

Kathy Garcia; Vighnesh Subramaniam; Boris Katz; Brian Cheung

Look, Then Speak: Social Tokens for Grounding LLMs in Visual Interactions

Kathy Garcia, Vighnesh Subramaniam, Boris Katz, Brian Cheung

Published: 23 Sept 2025, Last Modified: 17 Nov 2025UniReps2025EveryoneRevisionsBibTeXCC BY 4.0

Track: Extended Abstract Track

Keywords: Large Language Models, Social Interactions, Social Representational Alignment

TL;DR: We introduce social tokens, a new token type to improve social grounding and visual alignment in LLMs.

Abstract: Social interactions remain a major challenge for large language models (LLMs), which struggle to incorporate visual context and social cues. We propose social tokens, a lightweight mechanism that introduces socially grounded visual information into a frozen LLM. To construct these tokens, we first fine-tune a visual encoder on videos of social interactions to learn embeddings that capture socially relevant cues. A small MLP then projects these embeddings into the LLM’s embedding space, where they are inserted into the input sequence as local and global summaries of the scene. This representational alignment enables the LLM to condition generation on social context without updating its parameters. Empirically, social tokens substantially reduce perplexity on social dialogue and caption datasets, improve alignment with human social judgments, and receive high attention weights during socially salient segments, underscoring both their utility and interpretability.

Submission Number: 98

Loading