Cultural Fidelity in Large-Language Models: An Evaluation of Online Language Resources as a Driver of Model Performance in Value Representation
Abstract: Training data for LLMs increasingly embed societal values aligned with the data's language and cultural origin. Our analysis reveals that 44\% of GPT-4o's ability to reflect a country's societal values (per the World Values Survey) correlates with the availability of digital resources in that society's primary language. Error rates in the lowest-resource languages were more than five times higher than in the highest-resource ones. With a dataset of 21 country-language pairs, each containing 94 survey questions verified by native speakers, we demonstrate the link between LLM performance and online data availability. A weaker link and differentiated results for GPT-4-turbo highlight efforts to improve familiarity with non-English languages beyond web-scraped data. This performance disparity in value representation, particularly affecting lower-resource languages in the Global South, risks deepening digital divides.
Paper Type: Short
Research Area: Computational Social Science and Cultural Analytics
Research Area Keywords: language/cultural bias analysis, sociolinguistics, less-resourced languages
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings
Languages Studied: English, Spanish, German, Japanese, Russian, Portuguese, Turkish, Farsi, Mandarin, Indonesian, Vietnamese, Korean, Greek, Serbian, Hindi, Burmese, Swahili, Filipino, Tajik, Amharic, Hausa, Shona
Submission Number: 1217
Loading