Abstract: Open-access text corpora are crucial for advancing research in natural language processing (NLP) and computational social science (CSS). Despite the growing availability of datasets, resources for languages other than English, such as German, remain scarce. This limits large-scale studies on linguistic, cultural, and societal trends and hinders research of complex issues like gender bias and discrimination. To address this gap, we present \texttt{taz2024full}, to our knowledge, the largest publicly available dataset of German newspaper articles to date. Comprising over 1.8 million articles from the German newspaper "taz" spanning 1980 to 2024. Unfortunately, including other sources in the corpus was impossible, as no other German newspaper provided free access to their data or allowed the publication of such a dataset. While access could have been obtained through paid licensing, this would not have guaranteed full data availability, and legal restrictions would have prohibited the release of the corpus for public use. As a result, "taz" remains the sole source for this dataset.
To demonstrate the potential of the corpus for bias and discrimination research, we analyse how references to different genders have evolved over more than four decades of reporting. Our findings reveal a persistent imbalance, with men consistently appearing more frequently in articles and receiving more textual space. However, we also observe a gradual shift towards a more balanced representation of genders in recent years. By adapting and scaling an existing pipeline for detecting gender bias and discrimination in news media, we provide researchers with a structured approach to studying actor representation, sentiment, and linguistic framing in German journalistic texts.
The taz2024full corpus and its accompanying pipeline support a wide range of research applications, from studying language evolution to investigating media bias and discrimination. By making this resource publicly available and demonstrating its application, we aim to facilitate interdisciplinary research, foster inclusivity in language technologies, and contribute to a more informed selection of training data for NLP models.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation, language resources, NLP datasets, automatic evaluation of datasets, evaluation methodologies, evaluation, metrics
Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: German
Submission Number: 861
Loading