Abstract: There is an increasing trend in using neural methods for dialogue model evaluation. Lack of a framework to investigate these metrics
can cause dialogue models to reflect their biases and cause unforeseen problems during interactions. In this work, we propose an adversarial test-suite which generates problematic variations of various dialogue aspects, e.g. logical entailment, using automatic heuristics.
We show that dialogue metrics for both opendomain and task-oriented settings are biased in their assessments of different conversation
behaviors and fail to properly penalize problematic conversations, by analyzing their assessments of these problematic examples. We
conclude that variability in training methodologies and data-induced biases are some of the main causes of these problems. We also
conduct an investigation into the metric behaviors using a black-box interpretability model which corroborates our findings and provides evidence that metrics pay attention to the problematic conversational constructs signaling a misunderstanding of different conversation semantics
0 Replies
Loading