CultureManip: A Novel Benchmark for Mental Manipulation Detection Across Multilingual Settings

Published: 14 Dec 2025, Last Modified: 14 Dec 2025LM4UC@AAAI2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: mental manipulation detection, multilingual NLP, low-resource languages, large language models, cross-cultural AI
TL;DR: We present CultureManip, a benchmark dataset showing that LLMs' manipulation detection capabilities degrade significantly from high-resource to low-resource languages, exposing critical gaps in cross-cultural AI safety.
Abstract: Detecting mental manipulation is a culturally dependent and highly subjective task. We introduce CultureManip, a multilingual benchmark for manipulation detection across English, Spanish, Chinese, and Tagalog. Using raw inter-annotator agreement as our evaluation metric, we compare human–human consistency with human–LLM agreement to assess how well ChatGPT-3.5 Turbo aligns with native speakers. Human–LLM agreement is 48% in English, 41% in Spanish, 28% in Chinese, and 20% in Tagalog, revealing sharp performance drops outside of English. These results demonstrate a clear correlation between language resource availability and detection accuracy. Notably, Spanish exhibits the largest decline relative to its high human–human agreement, indicating a mismatch between model assumptions and Spanish pragmatic norms. Chinese and Tagalog show both lower human–human consistency and additional model degradation, reflecting challenges tied to indirectness, politeness strategies, and translation artifacts. These findings highlight significant cultural and linguistic gaps in current LLMs and underscore the need for culturally-aware, multilingual approaches to manipulation detection.
Submission Number: 22
Loading