Covert Antagonistic Language in Large Language Models: Definition, Evaluation, and Capability-Dependent Emergence

Covert Antagonistic Language in Large Language Models: Definition, Evaluation, and Capability-Dependent Emergence

ACL ARR 2026 January Submission2827 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Covert antagonistic language, sarcasm, pragmatic safety, large language models, alignment, safety evaluation, human-AI interaction, user trust, theory of mind, conversational agents

Abstract: Large Language Models (LLMs) are increasingly deployed as conversational agents, yet safety evaluation remains focused on overt toxicity. We investigate a complementary pragmatic phenomenon: Covert Antagonistic Language (CAL), where models express hostility through indirect mechanisms, such as sarcasm and condescension, while maintaining surface-level politeness. Building on Gricean pragmatics, we propose an operational definition and a multi-dimensional evaluation framework for CAL. We apply this framework to 15 prominent LLMs using controlled prompts. Our analysis yields three key findings: (i) CAL constitutes a coherent construct measurable via sarcasm, indirectness, and deceptive intent; (ii) CAL expression is capability-dependent, remaining near floor level in cooperative contexts but scaling positively with general model capability (Elo score) under adversarial prompting; and (iii) a within-subjects user study ($N=20$) reveals that high-CAL outputs significantly erode trust and increase cognitive effort compared to low-CAL baselines. Finally, an exploratory analysis suggests that while instruction tuning may sharpen latent antagonistic capabilities, current safety tuning often results in pragmatically uncooperative refusals rather than constructive resolution. Our findings frame CAL as a sophisticated pragmatic alignment failure not captured by standard toxicity benchmarks.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: Ethics, Bias, and Fairness, Human-centered NLP, Resources and Evaluation

Contribution Types: Model analysis & interpretability, Data analysis, Theory

Languages Studied: English

Submission Number: 2827

Loading