Abstract: The global impact of the COVID-19 pandemic has prompted a need for a comprehensive understanding of public sentiment and reactions. Despite the existence of numerous public datasets on COVID-19, some reaching staggering volumes, even up to 100 billion, they face challenges related to the availability of labeled data and the presence of coarse-grained or inappropriate sentiment labels. In this paper, we present SenWave, a novel fine-grained sentiment analysis dataset tailored for COVID-19 tweets, covering ten fine-grained categories in five different languages. The dataset includes 10,000 annotated English tweets, 10,000 annotated Arabic tweets, and 30,000 translated Spanish, French, and Italian tweets from English tweets. Additionally, it encompasses more than 105 million unlabeled tweets from the first wave of COVID-19. To facilitate accurate fine-grained sentiment classification, we fine-tuned pre-trained transformer-based language models on the labeled tweets. Our study goes beyond this by offering detailed analysis and revealing intriguing insights into the evolving emotional landscape over time in different languages, countries, and topics. Furthermore, we evaluate the compatibility of our dataset with ChatGPT. Our dataset and code are publicly available on an anonymous GitHub \footnote{https://anonymous.4open.science/r/SenWave-73D0}. We anticipate that this work will encourage more fine-grained sentiment analysis on complex events within the NLP community.
Paper Type: long
Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Contribution Types: Data resources, Data analysis
Languages Studied: English, Arabic, Spanish, French, Italian
0 Replies
Loading