HalluTrigger: Triggering Input-Conflicting Hallucinations in Large Language Models with Semantic-guided Metamorphic Relations

ACL ARR 2025 May Submission768 Authors

15 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While Large Language Models (LLMs) have developed rapidly across a range of natural language processing (NLP) tasks, they have also raised concerns about their security and reliability. One such concern is Input-Conflicting Hallucination (ICH), a type of hallucination that conflicts with user input. Since the data annotation in NLP tasks is expensive and labor-intensive, existing ICH attack methods have adopted Metamorphic Testing to bypass the oracle problem. However, these attacks are black-box methods that are restricted to question answering tasks, limited to a few Metamorphic Relations (MRs), and easily defended by decoder-only LLMs. % However, these attacks suffer from four limitations: (1) they are black-box methods, (2) they are restricted to question answering, (3) they propose a limited set of Metamorphic Relations (MRs), and (4) they can be easily defended against by decoder-only LLMs. In response, we propose \textsc{Hallu-TRIG}, a simple yet effective grey-box method that constructs six semantic-guided MRs to generate attack cases, and proposes a diversity-guided test case prioritization method to enhance its efficiency. We evaluate \textsc{Hallu-TRIG} on four NLP datasets and three popularly used target LLMs. As a result, the designed MRs achieve higher hallucination trigger rates than existing state-of-the-art baselines, and the diversity-guided prioritization can efficiently trigger ICHs with less time.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Metamorphic Testing, Large Language Model, Input-Conflicting Hallucination
Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 768
Loading