Abstract: This study defines sensitive prompts as those likely to trigger refusal, warnings, or guarded responses due to violations of model's content policies. We propose a three-level typology pf sensitive prompts: unacceptable, high, and low sensitivity. Utilizing a dataset of 239 real-world human-ChatGPT interaction that are labeled as sensitive by ChatGPT, we evaluate seven LLMs from the U.S., China, and Europe in both English and Chinese. Our cross-lingual, cross-model analysis reveals different moderation behaviors. While the U.S.-based models exhibit stronger consistency and higher refusal rates, Chinese models demonstrate more language-dependent behavior and moderation asymmetries. Furthermore, we uncover misalignments between classification and response behavior in several models, raising concerns about transparency and reliability in moderation mechanisms. Our findings offer empirical insights for content moderation and future safety AI design.
Paper Type: Short
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: ethical considerations in NLP applications, policy and governance
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English, Chinese
Submission Number: 1907
Loading