Building Arabic corpora from WikisourceDownload PDFOpen Website

Published: 01 Jan 2013, Last Modified: 15 Jun 2023AICCSA 2013Readers: Everyone
Abstract: This paper describes a new tool that helps extracting clean text from the Arabic Wikisource dump in order to build corpora. The tool purpose is illustrated by the generation of a subcorpus from Wikisource, which is a step towards the building of an evaluation corpus for Arabic intrinsic plagiarism detection.
0 Replies

Loading