NYTAC-CC: A Climate Change Subcorpus based on New York Times Articles

NYTAC-CC: A Climate Change Subcorpus based on New York Times Articles

23 May 2024 (modified: 18 Jun 2024)Submitted to ClimateNLP 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Climate Change, Topic Modeling, NLP, Corpus

TL;DR: This paper presents a CC-specific subcorpus extracted from the 1.8 million New York Times Anno- tated Corpus, marking the first CC analysis on this data.

Abstract: Over the past decade, the analysis of discourses on climate change (CC) has gained increased interest within both the social sciences and the NLP community. Textual resources are crucial for understanding how narratives about this phenomenon are crafted and delivered. However, while there is growing attention on social media resources, there still is a scarcity of datasets that cover CC in news media in a representative way. This paper presents a CC-specific subcorpus extracted from the 1.8 million New York Times Annotated Corpus, marking the first CC analysis on this data. The subcorpus was created by combining different methods for text selection to ensure representativeness and reliability of the subcorpus, which is validated using ClimateBERT. To provide initial insights into the CC subcorpus, we discuss the results of a topic modeling experiment (LDA). These show the diversity of contexts in which CC is discussed in news media over time, which is relevant for various downstream tasks.

Archival Submission: arxival

Submission Number: 17

Loading