Sensitive prompts: How LLMs respond to prompts that violate its content policy?

Sensitive prompts: How LLMs respond to prompts that violate its content policy?

ACL ARR 2025 May Submission1907 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This study defines sensitive prompts as those likely to trigger refusal, warnings, or guarded responses due to violations of model's content policies. We propose a three-level typology pf sensitive prompts: unacceptable, high, and low sensitivity. Utilizing a dataset of 239 real-world human-ChatGPT interaction that are labeled as sensitive by ChatGPT, we evaluate seven LLMs from the U.S., China, and Europe in both English and Chinese. Our cross-lingual, cross-model analysis reveals different moderation behaviors. While the U.S.-based models exhibit stronger consistency and higher refusal rates, Chinese models demonstrate more language-dependent behavior and moderation asymmetries. Furthermore, we uncover misalignments between classification and response behavior in several models, raising concerns about transparency and reliability in moderation mechanisms. Our findings offer empirical insights for content moderation and future safety AI design.

Paper Type: Short

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: ethical considerations in NLP applications, policy and governance

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English, Chinese

Submission Number: 1907

Loading