Keywords: Large Language Models, cultural competency, spectral analysis, cultural evaluation, knowledge structures, meta-culture
Abstract: Most cultural evaluation frameworks for Large Language Models (LLMs) compare model outputs with ground-truth answers, capturing mainly factual awareness. This overlooks whether models internalize broader cultural structures and pluralism. In this paper, we introduce a spectral-analysis-based framework to uncover large-scale structural patterns in models' cultural knowledge. We test eight LLMs of different sizes across nine cultural domains (food, religion, language, etc.) spanning 170 countries, comparing their learned structures with human data. Results show that instruction-tuned LLMs align more closely with human patterns than older models like GPT-2 and GPT-J. However, model size is not always an advantage, and performance asymptotes: Llama-8B and Gemma-2B perform as well or better than their larger-sized counterparts: Llama-70B and Gemma-9B. These findings differ from model rankings on existing probing-based cultural benchmarks, showing that our method captures a distinct aspect of cultural competency. Furthermore, initial simulation-based experiments demonstrate that compared to traditional metrics of cultural awareness, the proposed spectral metric is better able to predict a model's ability to serve a user from an unfamiliar background.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24679
Loading