Building a Named Entity Annotated Bilingual English-Vietnamese Corpus

Tuan-An Dao, Hung-Thinh Truong, Long H. B. Nguyen, Dien Dinh

Published: 2018, Last Modified: 19 Jun 2023KSE 2018Readers: Everyone

Abstract: Bilingual corpora play an essential role in Cross-lingual Natural Language Processing tasks such as Machine Translation, Information Extraction, and Information Retrieval. Nevertheless, manually building such corpora is a time-consuming and expensive task, especially for resource-limited languages like Vietnamese. In this paper, we propose a symmetric Named Entity Alignment method to automatically construct a named entity annotated bilingual English-Vietnamese corpus. Our system uses expansion heuristics for candidate generation and bilingual features for candidate selection. The proposed system outperforms the state-of-the-art method in English-Vietnamese Named Entity Alignment, increasing F1-score from 82.68% to 86.57%. Moreover, Vietnamese Named Entity Recognition performance improves by 23.41% in terms of F1-score compared to Conditional Random Field model (StanfordNER). Our method can be generalized to develop corpora in other resource-limited languages.

0 Replies