Keywords: Cultural alignment, Multilingual LLMs, Cross-cultural NLP
Abstract: Current alignment strategies increasingly rely on reasoning-based evaluations and safety fine-tuning to improve robustness and mitigate bias. We challenge the efficacy of these paradigms in cross-cultural contexts through a large-scale diagnostic study of Large Language Models. Using over 820,000 data points derived from authoritative surveys across the Middle East and North Africa (MENA), we probe the internal representations and reasoning dynamics of seven diverse models. Our analysis uncovers three systematic failures. First, we identify reasoning-induced degradation: prompting models to explain their reasoning is associated with decreased cultural alignment scores. Second, we reveal logit leakage: models exhibit performative safety by refusing sensitive questions in generated text while simultaneously assigning high probability mass (>75%) to biased answers in their internal distributions. Third, we demonstrate linguistic determinism: internal representations collapse diverse nations into simplistic clusters based solely on language family, overriding actual cultural heterogeneity. These findings suggest that current multilingual alignment is superficial, relying on linguistic proxies rather than genuine cultural understanding. We release the MENAValues diagnostic suite to facilitate further research into the interpretability and faithfulness of cross-cultural alignment.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/fairness evaluation, language/cultural bias analysis, ethical considerations in NLP applications, transparency
Contribution Types: Model analysis & interpretability
Languages Studied: English, Persian, Turkish, Arabic
Submission Number: 7255
Loading