Repertoires, Not Scores: Instability as Signal in Cultural Evaluation of LLMs

Published: 01 Jun 2026, Last Modified: 01 Jun 2026Culture x AI 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmark Design; Cultural Competence; Values; Context-Conditioned Evaluation.
TL;DR: Cultural competence in LLMs is structured variation across contexts, not agreement with a fixed value profile, and current benchmarks cannot distinguish the two — we propose persistence and repertoire alignment to fix this.
Abstract: Cultural evaluation of large language models typically scores agreement between model outputs and national-average responses on survey instruments like Hofstede's dimensions or the World Values Survey. This rests on a values paradigm that contemporary sociology has substantially revised: cultural competence is now widely viewed as the contextual deployment of repertoires rather than the expression of stable values. We argue that observed LLM "value instability" across paraphrases and framings is, under this view, what cultural competence should look like — and that current benchmarks cannot distinguish competent contextual adaptation from random prompt sensitivity. We propose two evaluation criteria: persistence, the ratio of preference shift across substantive contexts to shift across incidental perturbations, and repertoire alignment, the divergence between model and human reference distributions on matched item-context cells. We formalize both, work through an example on cross-cultural pragmatics of indirect requests, and audit existing benchmarks against the criteria. Our framework recasts what cultural evaluation should measure and predicts that current cultural alignment methods systematically reduce the contextual variation they should preserve.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 67
Loading