Keywords: LLM, Evaluation, Culture
Abstract: Large language models (LLMs) evolve not only in scale and benchmark performance but also in how they mediate human communication. We evaluate GPT-4, Claude, DeepSeek, and Qwen on culturally sensitive scenarios involving identity, language, and facework, treating cultural adaptation as an emergent ability of the LLM lifecycle. Using controlled prompts and interpreting results through Hofstede’s and GLOBE frameworks, we find systematic divergences: Western models emphasize individualism and directness, while Chinese models adopt collectivist, high-context strategies. Moreover, GPT-4 shifts style when prompted in Chinese, revealing that cultural alignment is dynamic rather than fixed. These findings extend LLM evaluation beyond accuracy to the lifecycle of cross-cultural behavior, underscoring the need for culturally aware scaling and inclusive benchmarks.
Submission Number: 146
Loading