Covert Antagonistic Language in Large Language Models: Definition, Evaluation, and Capability-Dependent Emergence
Keywords: Covert antagonistic language, sarcasm, pragmatic safety, large language models, alignment, safety evaluation, human-AI interaction, user trust, theory of mind, conversational agents
Abstract: Large Language Models (LLMs) are increasingly deployed as conversational agents, yet safety evaluation remains focused on overt toxicity. We investigate a complementary pragmatic phenomenon: Covert Antagonistic Language (CAL), where models express hostility through indirect mechanisms, such as sarcasm and condescension, while maintaining surface-level politeness. Building on Gricean pragmatics, we propose an operational definition and a multi-dimensional evaluation framework for CAL. We apply this framework to 15 prominent LLMs using controlled prompts. Our analysis yields three key findings: (i) CAL constitutes a coherent construct measurable via sarcasm, indirectness, and deceptive intent; (ii) CAL expression is capability-dependent, remaining near floor level in cooperative contexts but scaling positively with general model capability (Elo score) under adversarial prompting; and (iii) a within-subjects user study ($N=20$) reveals that high-CAL outputs significantly erode trust and increase cognitive effort compared to low-CAL baselines. Finally, an exploratory analysis suggests that while instruction tuning may sharpen latent antagonistic capabilities, current safety tuning often results in pragmatically uncooperative refusals rather than constructive resolution. Our findings frame CAL as a sophisticated pragmatic alignment failure not captured by standard toxicity benchmarks.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Ethics, Bias, and Fairness, Human-centered NLP, Resources and Evaluation
Contribution Types: Model analysis & interpretability, Data analysis, Theory
Languages Studied: English
Submission Number: 2827
Loading