Abstract: Previous work on probing word representa- tions for linguistic knowledge has focused on interpolation tasks. In this paper, we in- stead analyse probes in an extrapolation set- ting, where the inputs at test time are deliber- ately chosen to be ‘harder’ than the training examples. We argue that such an analysis can shed further light on the open question whether probes actually decode linguistic knowledge, or merely learn the diagnostic task from shal- low features. To quantify the hardness of an example, we consider scoring functions based on linguistic, statistical, and learning-related criteria, all of which are applicable to a broad range of NLP tasks. We discuss the relative merits of these criteria in the context of two syntactic probing tasks, part-of-speech tagging and syntactic dependency labelling. From our theoretical and experimental analysis, we con- clude that distance-based and hard statistical criteria show the clearest differences between interpolation and extrapolation settings, while at the same time being transparent, intuitive, and easy to control.
0 Replies
Loading