Persian Sentiment Analysis Without Training Data Using Cross-Lingual Word Embeddings

Mohammad Aliramezani, Ehsan Doostmohammadi, Mohammad Hadi Bokaei, Hossein Sameti

Published: 01 Jan 2020, Last Modified: 20 May 2025IST 2020EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, a low-cost Persian Sentiment Analysis is performed without any Persian training data. A cross-lingual method is proposed to overcome the shortage of labeled Persian Sentiment Analysis datasets by using English as a high-resource language. A cross-lingual model between English and Persian is trained to generate aligned word embeddings that are used as the feature vectors in the sentiment model. Monolingual word embeddings used in cross-lingual approach are English FastText and Persian GloVe. VecMap method is used as the cross-lingual tool to make English and Persian word embeddings aligned in a supervised mode. Furthermore, a 5,000-word English-Persian bilingual dictionary is used as the supervision. Bilingual lexicon induction evaluation reveals that English and Persian are aligned properly in the joint space. The proposed Sentiment Analysis model is trained on an English dataset, and then is tested on Persian using aligned English-Persian word embeddings. The dataset used as the training data is Amazon Fine Food Reviews and Persian Snapp Food dataset is utilized as the test data. The model results show significant efficiency in the Sentiment Analysis task, though it does not use any Persian dataset in training procedure. The proposed cross-lingual Sentiment Analysis shows a good performance with F1-score of 78.16% on Persian test data.