Abstract: Amharic is a Semitic family language widely spoken in Ethiopia. Based on expertise recommendation, some of the document organized using this language contains complex texts that need further simplification. Such complexity is the level of difficultness of the text for understanding by the target readers. In addition to humans, this complex text challenges different NLP applications such as machine translation. To address this issue, we have developed three sequential models such as complexity classification, complex term detection, and simple text generation models. For the first model, we have used the pre-trained transformer-based models such as BERT and XLNET to train these models. 33.9k Amharic sentences are used, and for building the detection model 1002 complex terms are used. Lastly, 91k Amharic sentences are used to build the simple text generation model such as Word2Vec, Fastext, and Roberta. As the experimental result shows, the classification models such as BERT and XLNET score an accuracy of 86.1% and 70% respectively. For the specific complex term detection and to generate the simple equivalent text, the Word2Vec model has better prediction and ranking results. This Word2Vec generates the most similar simple terms with a cosine similarity of 0.91, while the Fastext scores 0.85 and Roberta 0.57. Addressing the syntactic complexity of Amharic text is our recommendation in this work for future research.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Machine Learning for Low-resourced Languages, LLM, Data Annotation and Tool development
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: Amharic
Submission Number: 224
Loading