Interview Evaluation: A Novel Approach for Automatic Evaluation of Conversational Question Answering Models

Xibo Li; Bowei Zou; Yifan Fan; Yanling Li; AiTi Aw; Yu Hong

Interview Evaluation: A Novel Approach for Automatic Evaluation of Conversational Question Answering Models

Xibo Li, Bowei Zou, Yifan Fan, Yanling Li, AiTi Aw, Yu Hong

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Question Answering

Submission Track 2: Resources and Evaluation

Keywords: Conversational Question Answering, Evaluation metrics, Conversational history, Conversational question generation, Prompting

Abstract: Conversational Question Answering (CQA) aims to provide natural language answers to users in information-seeking dialogues. Existing CQA benchmarks often evaluate models using pre-collected human-human conversations. However, replacing the model-predicted dialogue history with ground truth compromises the naturalness and sustainability of CQA evaluation. While previous studies proposed using predicted history and rewriting techniques to address unresolved coreferences and incoherencies, this approach renders the question self-contained from the conversation. In this paper, we propose a novel automatic evaluation approach, interview evaluation. Specifically, ChatGPT acts as the interviewer (Q agent) with a set of carefully designed prompts, and the CQA model under test serves as the interviewee (A agent). During the interview evaluation, questions are dynamically generated by the Q agent to guide the A agent in predicting the correct answer through an interactive process. We evaluated four different models on QuAC and two models on CoQA in our experiments. The experiment results demonstrate that our interview evaluation has advantages over previous CQA evaluation approaches, particularly in terms of naturalness and coherence. The source code is made publicly available.

Submission Number: 1810

Loading