Abstract: Collocation identification is an important dimension for multiple natural language processing tasks. In Mandarin, due to the orthography and the highly compositional nature, identifying collocations is especially challenging. While most popular segmentation tools can identify common collocations, their performances are largely sabotaged when dealing with domain-specific texts. In this paper, we present a novel collocation extraction technique aimed at domain-specific texts through iterated segmentation based on the popular mutual information measure and its other variant, averaged mutual information. It has been found that while mutual-information-based collocation extractions did not benefit from iterated segmentation, collocation extractions based on averaged mutual information performed better after several times of iterated segmentation. Specifically, differences between mutual information and averaged mutual information have been identified. While segmentation based on mutual information reached generally higher precision, non-collocations extracted with mutual information had generally larger edit distances than those extracted with averaged mutual information.
Paper Type: short
Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics
Contribution Types: Model analysis & interpretability, Theory
Languages Studied: Taiwan Mandarin, Taiwan Southern Min
0 Replies
Loading