Iterated collocation extraction through mutual information in Mandarin Legal DocumentsDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Collocation identification is an important dimension for multiple natural language processing tasks. In Mandarin, due to the orthography and the highly compositional nature, identifying collocations is especially challenging. While most popular segmentation tools can identify common collocations, their performances are largely sabotaged when dealing with domain-specific texts. In this paper, we present a novel collocation extraction technique aimed at domain-specific texts through iterated segmentation based on the popular mutual information measure and its other variant, averaged mutual information. It has been found that while mutual-information-based collocation extractions did not benefit from iterated segmentation, collocation extractions based on averaged mutual information performed better after several times of iterated segmentation. Specifically, differences between mutual information and averaged mutual information have been identified. While segmentation based on mutual information reached generally higher precision, non-collocations extracted with mutual information had generally larger edit distances than those extracted with averaged mutual information.
Paper Type: short
Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics
Contribution Types: Model analysis & interpretability, Theory
Languages Studied: Taiwan Mandarin, Taiwan Southern Min
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview