Automatic Hallucination Assessment for Aligned Large Language Models via Transferable Adversarial Attacks

Xiaodong Yu; Hao Cheng; Xiaodong Liu; Dan Roth; Jianfeng Gao

Automatic Hallucination Assessment for Aligned Large Language Models via Transferable Adversarial Attacks

Xiaodong Yu, Hao Cheng, Xiaodong Liu, Dan Roth, Jianfeng Gao

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: LLM, Large Language Model, Hallucination, Adversarial Attack, Question Answering

Abstract: Although remarkable progress has been achieved preventing LLMs hallucinations, using instruction tuning and retrieval augmentation, it is currently difficult to measure the reliability of LLMs using available static data that is often not challenging enough and could suffer from data leakage. Inspired by adversarial machine learning, this paper aims to develop an automatic method for generating new evaluation data by appropriately modifying existing data on which LLMs behave faithfully. Specifically, this paper presents AutoDebug, an LLM-based framework for using prompt chaining to generate transferable adversarial attacks (in the form of question-answering examples). We seek to understand the extent to which these trigger hallucination behavior in LLMs. We first implement our framework using ChatGPT and evaluate the resulting two variants of a popular open-domain question-answering dataset, Natural Questions (NQ) on a collection of open-source and proprietary LLMs under various prompting settings. Our generated evaluation data is human-readable and, as we show, humans can answer these modified questions well. Nevertheless, we observe pronounced accuracy drops across multiple LLMs including GPT-4. Our experimental results confirm that LLMs are likely to hallucinate in two categories of question-answering scenarios where (1) there are conflicts between knowledge given in the prompt and their parametric knowledge, or (2) the knowledge expressed in the prompt is complex. Finally, the adversarial examples generated by the proposed method are transferrable across all considered LLMs, making our approach viable for LLM-based debugging using more cost-effective LLMs.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6876

Loading