Precise Task Formalization Matters in Winograd Schema Evaluations

Haokun Liu, William Huang, Dhara Mungra, Samuel R. Bowman

05 Jun 2020 (modified: 09 Oct 2020)OpenReview Anonymous Preprint Blind SubmissionReaders: Everyone

Keywords: NLP, WSC, commonsense reasoning, evaluation

Abstract: Performance on the Winograd Schema Challenge (WSC), a respected commonsense reasoning benchmark, recently rocketed from chance accuracy to 89% on the SuperGLUE leaderboard, with relatively little corroborating evidence of a correspondingly large improvement in reasoning ability. We hypothesize part of this improvement comes from recent changes in task formalization by users of the dataset: the combination of input specification, loss function, and the way pretrained parameters are used. We perform an ablation on two Winograd Schema datasets that interpolates between the two formalizations used before and after this surge, and find (i) framing the task as multiple choice improves performance dramatically and (ii) several additional techniques, including the reuse of a pretrained language modeling head, can mitigate the model's extreme sensitivity to hyperparameters. The impact of task formalization may result in overly optimistic reports of improved commonsense reasoning performance. We urge future benchmark creators to impose additional structure to minimize the impact of formalization decisions on reported results.

0 Replies