TextDescriptives: A Python package for calculating a large variety of metrics from text

Lasse Hansen, Ludvig Renbo Olsen, Kenneth Enevoldsen

07 Jun 2023OpenReview Archive Direct UploadReaders: Everyone

Abstract: Natural language processing (NLP) tasks often require a thorough understanding and description of the corpus. Document-level metrics can be used to identify low-quality data, assess outliers, or understand differences between groups. Further, text metrics have long been used in fields such as the digital humanities where e.g. metrics of text complexity are commonly used to analyse, understand and compare text corpora. However, extracting complex metrics can be an error-prone process and is rarely rigorously tested in research implementations. This can lead to subtle differences between implementations and reduces the reproducibility of scientific results. TextDescriptives offers a simple and modular approach to extracting both simple and complex metrics from text. It achieves this by building on the spaCy framework (Honnibal et al., 2020). This means that TextDescriptives can easily be integrated into existing workflows while leveraging the efficiency and robustness of the spaCy library. The package has already been used for analysing the linguistic stability of clinical texts (Hansen et al., 2022), creating features for predicting neuropsychiatric conditions (Hansen et al., 2023), and analysing linguistic goals of primary school students (Tannert, 2023).

0 Replies