- Keywords: stance detection, (dis)agreement detection, pre-trained language models, graph representation learning
- TL;DR: This paper presents a comment-reply dataset collected from Reddit which unveils opportunities to combine pre-trained language models and graph representation learning methods for (dis)agreement detection.
- Abstract: In this paper, we introduce DEBAGREEMENT, a dataset of 42,894 comment-reply pairs from the popular discussion website Reddit, annotated with agree, neutral or disagree labels. We collect data from five forums on Reddit: r/BlackLivesMatter, r/Brexit, r/climate, r/democrats, r/Republican. For each forum, we select comment pairs such that they form altogether a user interaction graph. DEBAGREEMENT presents a challenge for Natural Language Processing (NLP) systems, as it contains slang, sarcasm and topic-specific jokes, often present in online exchanges. We evaluate the performance of state-of-the-art language models on a (dis)agreement detection task, and investigate the use of contextual information available (graph, authorship, and temporal information). Since recent research has shown that context, such as social context or knowledge graph information, enables language models to better perform on downstream NLP tasks, DEBAGREEMENT provides novel opportunities for combining graph-based and text-based machine learning techniques to detect (dis)agreements online.
- Supplementary Material: pdf
- URL: https://scale.com/open-datasets/oxford
- Contribution Process Agreement: Yes
- Dataset Url: https://scale.com/open-datasets/oxford
- License: Creative Commons Attribution 4.0 International Public License (“CC BY 4.0”)
- Author Statement: Yes