FRAPI: A Framework for Generating Competitive Test Beds for Task-Oriented Dialog SystemsDownload PDF

Anonymous

04 Mar 2022 (modified: 05 May 2023)Submitted to NLP for ConvAIReaders: Everyone
TL;DR: Empirically, we highlight simplicity of current test set for evaluating dialog-state-tracking models and propose a framework for creation of richer and diverse test sets.
Abstract: Current test sets for task-oriented dialog systems tend to overestimate the systems' performance on conversation-level tasks like dialog state tracking. We observed that they fail to showcase similar efficacy when tested on some commonly occurring realistic scenarios like repetition and clarification through dialogues. This limited generalizability of models can be attributed to two key aspects. Firstly, crowd-workers who create these test sets have a highly restrictive/limited dialog policy to generate samples, leading to very rigid and less realistic samples. Secondly, the train and test splits are plagued with annotator biases since the same set of crowd-workers is recruited to create both splits. Using a graphical framework for dialogues, called Conversation Flow Modeling, we highlight the limitations for one such dataset. While motivating practitioners to create stricter test sets, we propose "FRAMEWORK FOR AUTOMATED PATTERN INDUCTION" (FRAPI), an HCI (human-computer-interaction) framework for the induction of additional natural dialog flows. FRAPI helps create annotator-bias-free patterns in testbeds of task-oriented dialog systems with minimal human intervention. Using FRAPI, we build a testbed for the models trained on the MultiWOZ data set. The proposed testbed helps validate learning from diverse yet natural patterns. Through it, we highlight the shortcomings of the current architectures to model simple, realistic human-level language variations on dialog state tracking.
0 Replies

Loading