What Text Do Language Models Trust?

Alexander Wan; Eric Wallace

What Text Do Language Models Trust?

Alexander Wan, Eric Wallace

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: Language Models, Question Answering, LLM Agents, Alignment

Abstract: Large language models (LLMs) are being tasked with increasingly open-ended, delicate, and subjective tasks. In particular, retrieval-augmented models can now answer subjective questions (e.g., ``is aspartame linked to cancer''), and in doing so, they condition on text that comes from arbitrary websites, whose evidence may \emph{conflict} with one another. Humans are also faced with these same conflicts, and in order to come to an answer they critically evaluate the arguments, trustworthiness, and credibility of a source. In this work, we study what types of evidence current LLMs find convincing, and if they are able to make similar judgements that align with human preferences. Specifically, we construct ConflictingQA, a new benchmark that pairs controversial questions with a series of evidence documents that contain different facts (e.g., quantitative results), argument styles (e.g., appeals to authority), and answers (\texttt{yes} or \texttt{no}). We first find that models are highly corrigible: models can update their predictions when given novel contexts, even when it conflicts with their prior knowledge. However, the type of evidence that models find convincing does not align well with human preferences.

Primary Area: societal considerations including fairness, safety, privacy

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4717

Loading