Truths, Lies, and Nudge. HAUNT: A Framework to Probe LLMs’ Self-consistency in Closed Domains Via Adversarial Nudge

Truths, Lies, and Nudge. HAUNT: A Framework to Probe LLMs’ Self-consistency in Closed Domains Via Adversarial Nudge

ACL ARR 2026 January Submission10225 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Hallucination, Sycophancy

Abstract: Hallucinations pose a critical challenge to the real-world deployment of large language models (LLMs) in high-stakes domains. In this paper, we present a framework for stress testing factual fidelity in LLMs in the presence of adversarial nudge. Our framework consists of three steps. First, we instruct the LLM to produce sets of truths and lies consistent with the closed domain in question. Next, we instruct the LLM to verify the same set of assertions as truths and lies consistent with the same closed domain. Finally, we test the robustness of the LLM against the lies generated (and verified) by itself. Our extensive evaluation, conducted using five widely known proprietary and six open LLMs across two closed domains of popular movies and novels, reveals a wide range of susceptibility to adversarial nudges: even among the strongest proprietary LLMs, $\texttt{Claude}$ exhibits strong resilience, $\texttt{GPT}$ and $\texttt{Grok}$ demonstrate moderate resilience, while $\texttt{Gemini}$ and $\texttt{DeepSeek}$ show weak resilience and open models fall short significantly.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Language Models, AI/LLM Agents

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 10225

Loading