I Am Aligned, But With Whom? Diagnosing Structural Alignment Failures in Multilingual LLMs
Keywords: Cultural alignment, multilingual LLMs, MENA, cross-lingual consistency, logit leakage, AI safety, value alignment
TL;DR: We diagnose three systematic failures in LLM cultural alignment across the MENA region: reasoning-induced degradation, logit leakage, and linguistic determinism.
Abstract: Current alignment strategies increasingly rely on reasoning-based evaluations and safety fine-tuning to improve robustness and mitigate bias. We challenge the efficacy of these paradigms in cross-cultural contexts through a large-scale diagnostic study of Large Language Models. Using over 820,000 data points derived from authoritative surveys across the Middle East and North Africa (MENA), we probe the internal representations and reasoning dynamics of seven diverse models. Our analysis uncovers three systematic failures. First, we identify reasoning-induced degradation: prompting models to explain their reasoning is associated with decreased cultural alignment scores. Second, we reveal logit leakage: models exhibit performative safety by refusing sensitive questions in generated text while simultaneously assigning high probability mass ($>75\%$) to biased answers in their internal distributions. Third, we demonstrate linguistic determinism: internal representations collapse diverse nations into simplistic clusters based solely on language family, overriding actual cultural heterogeneity. These findings suggest that current multilingual alignment is superficial, relying on linguistic proxies rather than genuine cultural understanding. We release the MENAValues diagnostic suite to facilitate further research into the interpretability and faithfulness of cross-cultural alignment.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 107
Loading