Is the Answer in the Text? Challenging ChatGPT with Evidence Retrieval from Instructive Text

Sophie Henning; Talita Anthonio; Wei Zhou; Heike Adel; Mohsen Mesgar; Annemarie Friedrich

Is the Answer in the Text? Challenging ChatGPT with Evidence Retrieval from Instructive Text

Sophie Henning, Talita Anthonio, Wei Zhou, Heike Adel, Mohsen Mesgar, Annemarie Friedrich

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Short Paper

Submission Track: Theme Track: Large Language Models and the Future of NLP

Submission Track 2: Question Answering

Keywords: question answering, hallucination, evidence retrieval, dataset creation, generative language models, chatgpt

TL;DR: We present a challenging task and dataset to ChatGPT to investigate its hallucination problems: Given a text document and a non-factoid question, recognize whether the text provides an answer to the question or not.

Abstract: Generative language models have recently shown remarkable success in generating answers to questions in a given textual context. However, these answers may suffer from hallucination, wrongly cite evidence, and spread misleading information. In this work, we address this problem by employing ChatGPT, a state-of-the-art generative model, as a machine-reading system. We ask it to retrieve answers to lexically varied and open-ended questions from trustworthy instructive texts. We introduce WHERE (WikiHow Evidence REtrieval), a new high-quality evaluation benchmark of a set of WikiHow articles exhaustively annotated with evidence sentences to questions that comes with a special challenge: All questions are about the article's topic, but not all can be answered using the provided context. We interestingly find that when using a regular question-answering prompt, ChatGPT neglects to detect the unanswerable cases. When provided with a few examples, it learns to better judge whether a text provides answer evidence or not. Alongside this important finding, our dataset defines a new benchmark for evidence retrieval in question answering, which we argue is one of the necessary next steps for making large language models more trustworthy.

Submission Number: 3191

Loading