Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks

Mona T. Diab, Kadri Hacioglu, Daniel Jurafsky

2004 (modified: 16 Jul 2019)HLT-NAACL (Short Papers) 2004Readers: Everyone

Abstract: To date, there are no fully automated systems addressing the community's need for fundamental language processing tools for Arabic text. In this paper, we present a Support Vector Machine (SVM) based approach to automatically tokenize (segmenting off clitics), part-of-speech (POS) tag and annotate base phrases (BPs) in Arabic text. We adapt highly accurate tools that have been developed for English text and apply them to Arabic text. Using standard evaluation metrics, we report that the SVM-TOK tokenizer achieves an Fβ=1 score of 99.12, the SVM-POS tagger achieves an accuracy of 95.49%, and the SVM-BP chunker yields an Fβ=1 score of 92.08.

0 Replies