On the importance of pre-processing in small-scale analyses of twitter: a case study of the 2019 Indian general election

Priyavrat Chauhan; Nonita Sharma; Geeta Sikka

On the importance of pre-processing in small-scale analyses of twitter: a case study of the 2019 Indian general election

Priyavrat Chauhan, Nonita Sharma, Geeta Sikka

Published: 01 Jan 2024, Last Modified: 01 Aug 2025Multim. Tools Appl. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The main purpose of this paper is to emphasize the role of data pre-processing in the sentiment analysis of Twitter data. The paper provides detailed analysis and methods to understand and handle Twitter data for analyzing public views during elections. We argue that in order to accurately assess public opinion towards a political party or leader, there is a need to focus on users’ personal tweets rather than tweets from news or media sources. We also argue that emojis, punctuations, stopwords, emphasized words, and some specific regions (Unicode, #, @) inside tweets play a very significant role in analyzing sentiments. In view of this, this paper provides a novel set of pre-processing steps that perform filtering and cleaning of tweets without losing any vital information. For experimentation, a small case study is taken that comprises 258,891 instances related to the 2019 Indian General Election from Twitter using #LoksabhaElection2019. A pre-trained sentiment analysis model called twitter-xlm-roberta-base-sentiment is used to analyze the sentiment of public tweets. Results show that tweets from media sources and the specific regions of tweets inject data bias and affect final sentiment analysis results. We found that out of the collected data, only 40% of tweets were useful for determining public sentiments for election analysis, while the rest were irrelevant media tweets. Also, an increase in negative and neutral sentiment outputs is observed due to the presence of media tweets and the specific regions. Further, explorative analysis analyzes public sentiments towards various political terms inferred using top2vec topic modeling.

Loading