Constructing a Large Scale Text Corpus Based on the Grid and Trustworthiness

Peifeng Li, Qiaoming Zhu, Peide Qian, Geoffrey C. Fox

Published: 2007, Last Modified: 17 Jul 2025TSD 2007EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The construction of a large scale corpus is a hard task. A novel approach is designed to automatically build a large scale text corpus with low cost and short building period based on the trustworthiness. It mainly solves two problems: how to automatically build a large scale text corpus on the Web and how to correct mistakes in the corpus. As Grid provides the infrastructure for processing large scale data, our approach uses Grid to collect and process language materials on the Web in the first stage. Then it picks out untrustworthy language materials in the corpus according to their trustworthiness, and checks them manually by users. After the check finishes, our approach computes the trustworthiness of each checked result and selects those ones with the highest trustworthiness as the correct results.