Open Source Corpus Analysis Tools for MalayDownload PDFOpen Website

2006 (modified: 03 Nov 2022)LREC 2006Readers: Everyone
Abstract: Tokenisers, lemmatisers and POS taggers are vital to the linguistic and digital furtherment of any language. In this paper, we present an open source toolkit for Malay incorporating a word and sentence tokeniser, a lemmatiser and a partial POS tagger, based on heavy reuse of pre-existing language resources. We outline the software architecture of each component, and present an evaluation of each over a 26K word sample of Malay text.
0 Replies

Loading