Keywords: Rumor Detection, Data Augmentation, Social Media, Neural Language Models, Weak Supervision
TL;DR: We propose a methodology of augmenting publicly available data for rumor studies based on samantic relatedness between limited labeled and unlabeled data.
Abstract: In this paper, we address the challenge of limited labeled data and class imbalance problem for machine learning-based rumor detection on social media. We present an offline data augmentation method based on semantic relatedness for rumor detection. To this end, unlabeled social media data is exploited to augment limited labeled data. A context-aware neural language model and a large credibility-focused Twitter corpus are employed to learn effective representations of rumor tweets for semantic relatedness measurement. A language model fine-tuned with the a large domain-specific corpus shows a dramatic improvement on training data augmentation for rumor detection over pretrained language models. We conduct experiments on six different real-world events based on five publicly available data sets and one augmented data set. Our experiments show that the proposed method allows us to generate a larger training data with reasonable quality via weak supervision. We present preliminary results achieved using a state-of-the-art neural network model with augmented data for rumor detection.