A Framework for Aligning Human Linguistics and AI Perception

Published: 04 Mar 2026, Last Modified: 27 Apr 2026HCAIR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Data, Human-AI co-performance, Common Ground, Perception alignment
TL;DR: In this work, we introduce and work over a formalization of how to align human linguistics and AI perception in both a theoretical basis, as well as demonstrating portions of it with AI on real human speech against human co-performer.
Abstract: Grounding natural language in perceptual representations is central to both human cognition and AI reasoning, yet remains challenging under ambiguity and partial information. We present a computational framework that models key aspects of human referential interpretation by aligning linguistic utterances with perceptual representations derived from large-scale, crowd-sourced imagery. The approach approximates human perceptual categorization using scale-invariant feature transform (SIFT) alignment and the Universal Quality Index (UQI), while lightweight linguistic preprocessing captures pragmatic variability in referring expressions. We evaluate the model on the Stanford Repeated Reference Game corpus (15,000 utterances paired with tangram stimuli), a benchmark designed to probe perceptual ambiguity and coordination in human communication. The system achieves robust grounding, requiring 65\% fewer utterances than human interlocutors to establish stable mappings, and correctly identifying targets from a single utterance 41.66\% of the time (compared to 20\% for humans). These results suggest that relatively simple perceptual–linguistic alignment mechanisms can exhibit human-competitive behavior on a classic cognitive task, offering insights into grounded reasoning, perceptual inference, and cross-modal concept formation.
Paper Type: New Full Paper
Submission Number: 27
Loading