Quantifying the Task-Specific Information in Text-Based Classifications

Anonymous

Quantifying the Task-Specific Information in Text-Based Classifications

Anonymous

17 Aug 2021 (modified: 05 May 2023)ACL ARR 2021 August Blind SubmissionReaders: Everyone

Abstract: Recently, neural natural language models have attained state-of-the-art performance on a wide variety of tasks, but the high performance can result from superficial, surface-level cues (Bender and Koller, 2020; Niven and Kao, 2020). These surface cues, as the ``shortcuts'' inherent in the datasets, do not contribute to the task-specific information (TSI) of the classification tasks. While it is essential to look at the model performance, it is also important to understand the datasets. In this paper, we consider this question: Apart from the information introduced by the shortcut features, how much task-specific information is required to classify a dataset? We formulate this quantity in an information-theoretic framework. While this quantity is hard to compute, we approximate it with a fast and stable method. TSI quantifies the amount of linguistic knowledge modulo a set of predefined shortcuts -- that contributes to classifying a sample from each dataset. This framework allows us to compare across datasets, saying that, apart from a set of ``shortcut features'', classifying the Multi-NLI task involves around 0.4 nats more TSI than the Quora Question Pair.

0 Replies

Loading