Sentiment Classification using Sentence Embeddings: Exploiting Sentence Transformer Loss Functions

TMLR Paper3385 Authors

24 Sept 2024 (modified: 26 Oct 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Evaluating customer sentiment plays a critical role in business success. By analyzing customer feedback, companies can swiftly identify expectations, areas for improvement, and pain points related to their products and services. Sentiment analysis, fueled by advances in natural language processing techniques, has become widely accepted for this purpose. In this study, we leverage the popular “Twitter US Airline Sentiment” dataset to develop a sentence transformer architecture based on pre-trained transformer models (MPNet and RoBERTa-Large). We fine-tune the model using appropriate loss functions to generate semantically rich sentence embeddings that are subsequently fed into machine learning algorithms. The resulting hybrid models achieve impressive sentiment prediction performances. Additionally, this study delves into the intricacies of various transformer loss functions that can be applied to fine-tune the sentence transformer model for enhanced sentiment classification performance. Our sentence transformer architecture based on RoBERTa-Large, fine-tuned on CosineSimilarity loss function and combined with XGBoost Classifier, achieved the maximum accuracy of 88.4%, while demonstrating high recall rates even for minority sentiment classes (77.3% for neutral and 83.9% for positive sentiment) without any data augmentation. Furthermore, to evaluate the robustness of our methodology, we also utilized the classic benchmark dataset “IMDB” and achieved an impressive accuracy of 95.9% using a sentence transformer architecture based on RoBERTa-Large, followed by fine-tuning with the CosineSimilarity loss function and combining it with a Support Vector Machine Classifier. We have also done a comparative analysis of our methodology vis-à-vis the advanced Meta-Llama-3-8B large language model in terms of performance, training time and inference time. Our study demonstrates that fine-tuned sentence transformer models can match or even outperform most existing techniques including transformer-based architectures for sentiment classification. They offer the added advantage of reduced computational load and maintain generalizability, even when encountering data that deviates from their fine-tuned training set.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We deeply value the insights provided by the esteemed and respected reviewers of our paper. Their expertise has greatly contributed to the improvement and rigor of our work The following are the major changes in our manuscript in response to our reviewers’ comments: 1. To evaluate the robustness of our methodology, we have done the following: a. Used another dataset “IMDB” and adopted similar approach to develop a fine-tuned sentence transformer model based on Roberta-Large as the underlying transformer model. The results are tabulated in table 3 of the paper. b. The above fine-tuned and trained model from IMDB training dataset was further evaluated on Yelp-2 test dataset to demonstrate our model’s generalizability under a binary sentiment classification scenario and the results are tabulated in table 3 of the paper. c. In addition to the MPNet transformer model, we have also developed a separate sentence transformer model based on Roberta-Large transformer model for our initial dataset “Twitter US Airline Sentiment” and the results are tabulated in table 3 of the paper. d. We used another dataset “Twitter Apple Sentiment” on which we evaluated our model fine-tuned and trained from our initial dataset “Twitter US Airline Sentiment” to demonstrate the model’s generalizability in a tri-class sentiment classification scenario and the results are tabulated in table 3 of the paper. e. Lastly, we have done a comparative analysis of our methodology and Meta-Llama-3-8B in terms of performance, training time, and inference time for the Twitter US Airline Sentiment and IMDB datasets under both zeroshot and fine-tuned settings which are tabulated in table 4 of our paper. 2. Table 1 in our paper now reflects the performance of only advanced models for a fair comparison with our approach. We have also removed the reference to "Word2vec+ RNN (two classes only) (Dang et al., 2020)" from table 1 to prevent confusion, as that study excluded "neutral" samples from the Twitter Airline sentiment dataset, making it a binary classification task. Therefore, their results are not comparable to our tri-class sentiment model. 3. We have added an additional paragraph to highlight the potential limitation of bias introduction through our methodology under the Broader Impact Statement section.
Assigned Action Editor: ~Jeffrey_Pennington1
Submission Number: 3385
Loading