Keywords: Ethno-nationality, Native Language Identification, Natural Language Processing, Machine Learning, Linear SVM, Less-controlled environments, ICE corpus
Abstract: Ethno-nationality is where nations are defined by a shared heritage, for instance it can be a membership of a common language, nationality, religion or an ethnic ancestry. The main goal of this research is to determine a person’s country-of-origin using English text written in less controlled environments, employing Machine Learning (ML) and Natural Language Processing (NLP) techniques. The current literature mainly focuses on determining the native language of English writers and a minimal number of researches have been conducted in determining the country-of-origin of English writers.
Further, most experiments in the literature are mainly based on the TOEFL, ICLE datasets which were collected in more controlled environments (i.e., standard exam answers). Hence, most of the writers try to follow some guidelines and patterns of writing. Subsequently, the creativity, freedom of writing and the insights of writers could be hidden. Thus, we believe it hides the real nativism of the writers. Further, those corpora are not freely available as it involves a high cost of licenses. Thus, the main data corpus used for this research was the International Corpus of English (ICE corpus). Up to this point, none of the researchers have utilised the ICE corpus for the purpose of determining the writers’ country-of-origin, even though there is a true potential.
For this research, an overall accuracy of 0.7636 for the flat classification (for all ten countries) and accuracy of 0.6224~1.000 for sub-categories were received. In addition, the best ML model obtained for the flat classification strategy is linear SVM with SGD optimizer trained with word (1,1) uni-gram model.
One-sentence Summary: This research focuses on determining a person's country of origin using his/her English text written in less-controlled environments using NLP and Machine Learning techniques.
Supplementary Material: zip
5 Replies
Loading