Submission Type: Regular Short Paper
Submission Track: Theme Track: Large Language Models and the Future of NLP
Submission Track 2: Question Answering
Keywords: question answering, hallucination, evidence retrieval, dataset creation, generative language models, chatgpt
TL;DR: We present a challenging task and dataset to ChatGPT to investigate its hallucination problems: Given a text document and a non-factoid question, recognize whether the text provides an answer to the question or not.
Abstract: Generative language models have recently shown remarkable success in generating answers to questions in a given textual context.
However, these answers may suffer from hallucination, wrongly cite evidence, and spread misleading information.
In this work, we address this problem by employing ChatGPT, a state-of-the-art generative model, as a machine-reading system.
We ask it to retrieve answers to lexically varied and open-ended questions from trustworthy instructive texts.
We introduce WHERE (WikiHow Evidence REtrieval), a new high-quality evaluation benchmark of a set of WikiHow articles exhaustively annotated with evidence sentences to questions that comes with a special challenge: All questions are about the article's topic, but not all can be answered using the provided context.
We interestingly find that when using a regular question-answering prompt, ChatGPT neglects to detect the unanswerable cases.
When provided with a few examples, it learns to better judge whether a text provides answer evidence or not.
Alongside this important finding, our dataset defines a new benchmark for evidence retrieval in question answering, which we argue is one of the necessary next steps for making large language models more trustworthy.
Submission Number: 3191
Loading