Iterated collocation extraction  through mutual information in Mandarin Legal Documents

Anonymous

Iterated collocation extraction through mutual information in Mandarin Legal Documents

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Collocation identification is an important dimension for multiple natural language processing tasks. In Mandarin, due to the orthography and the highly compositional nature, identifying collocations is especially challenging. While most popular segmentation tools can identify common collocations, their performances are largely sabotaged when dealing with domain-specific texts. In this paper, we present a novel collocation extraction technique aimed at domain-specific texts through iterated segmentation based on the popular mutual information measure and its other variant, averaged mutual information. It has been found that while mutual-information-based collocation extractions did not benefit from iterated segmentation, collocation extractions based on averaged mutual information performed better after several times of iterated segmentation. Specifically, differences between mutual information and averaged mutual information have been identified. While segmentation based on mutual information reached generally higher precision, non-collocations extracted with mutual information had generally larger edit distances than those extracted with averaged mutual information.

Paper Type: short

Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics

Contribution Types: Model analysis & interpretability, Theory

Languages Studied: Taiwan Mandarin, Taiwan Southern Min

0 Replies

Loading