Abstract: Natural language processing (NLP) tasks often require a thorough understanding and description
of the corpus. Document-level metrics can be used to identify low-quality data, assess outliers,
or understand differences between groups. Further, text metrics have long been used in fields
such as the digital humanities where e.g. metrics of text complexity are commonly used to
analyse, understand and compare text corpora. However, extracting complex metrics can be
an error-prone process and is rarely rigorously tested in research implementations. This can
lead to subtle differences between implementations and reduces the reproducibility of scientific
results.
TextDescriptives offers a simple and modular approach to extracting both simple and complex
metrics from text. It achieves this by building on the spaCy framework (Honnibal et al., 2020).
This means that TextDescriptives can easily be integrated into existing workflows while
leveraging the efficiency and robustness of the spaCy library. The package has already been
used for analysing the linguistic stability of clinical texts (Hansen et al., 2022), creating features
for predicting neuropsychiatric conditions (Hansen et al., 2023), and analysing linguistic goals
of primary school students (Tannert, 2023).
0 Replies
Loading