Transformer-based Amharic Complexity Classification and Simplification

Transformer-based Amharic Complexity Classification and Simplification

ACL ARR 2025 May Submission224 Authors

09 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Amharic is a Semitic family language widely spoken in Ethiopia. Based on expertise recommendation, some of the document organized using this language contains complex texts that need further simplification. Such complexity is the level of difficultness of the text for understanding by the target readers. In addition to humans, this complex text challenges different NLP applications such as machine translation. To address this issue, we have developed three sequential models such as complexity classification, complex term detection, and simple text generation models. For the first model, we have used the pre-trained transformer-based models such as BERT and XLNET to train these models. 33.9k Amharic sentences are used, and for building the detection model 1002 complex terms are used. Lastly, 91k Amharic sentences are used to build the simple text generation model such as Word2Vec, Fastext, and Roberta. As the experimental result shows, the classification models such as BERT and XLNET score an accuracy of 86.1% and 70% respectively. For the specific complex term detection and to generate the simple equivalent text, the Word2Vec model has better prediction and ranking results. This Word2Vec generates the most similar simple terms with a cosine similarity of 0.91, while the Fastext scores 0.85 and Roberta 0.57. Addressing the syntactic complexity of Amharic text is our recommendation in this work for future research.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: Machine Learning for Low-resourced Languages, LLM, Data Annotation and Tool development

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: Amharic

Submission Number: 224

Loading