Abstract: Neural models that extend the pretrain-then-finetune paradigm continue to achieve new state-of-the-art results in dialogue state tracking (DST) benchmarks on joint goal accuracy (JGA). However, motivated by CheckList (Ribeiro et al. 2020), we argue for a holistic assessment of DST models since JGA is unable to capture robustness to the inevitable test-time distribution shifts. To this end, we build on recent work on robustness testing in task-oriented dialogue and introduce CheckDST, an instantiation of CheckList for DST that quantifies robustness with test set augmentations and new metrics that measure consistency. Using CheckDST, we are able to extensively compare state-of-the-art DST models, finding that, although span-based classification models achieve slightly better JGA on the original test set than generation models, they are significantly less robust to distribution shift. Secondly, we observe that while stopping training early, e.g. at the first epoch, hurts JGA, the resulting models are significantly more robust to distribution shift. Lastly, guided by the weaknesses exposed by CheckDST, we explore training DST models that simultaneously boost JGA and CheckDST metrics and report preliminary success with PrefineDST, a simple generation model pretrained with non-target datasets to internalize reasoning skills relevant to dialogue state tracking.
Paper Type: long
0 Replies
Loading