Mining Infrequent High-Quality Phrases from Domain-Specific Corpora

Published: 2020, Last Modified: 22 May 2026CIKM 2020EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Phrase mining is a fundamental task for text analysis and has various downstream applications such as named entity recognition, topic modeling, and relation extraction. In this paper, we focus on mining high-quality phrases from domain-specific corpora with special consideration of infrequent ones. Previous methods might miss infrequent high-quality phrases in the candidate selection stage. And these methods rely on explicit features to mine phrases while rarely considering the implicit features. In addition, completeness is rarely explicitly considered in the evaluation of a high-quality phrase. In this paper, we propose a novel approach that exploits a sequence labeling model to capture infrequent phrases. And we employ implicit semantic features and contextual POS tag statistics to measure meaningfulness and completeness, respectively. Experiments over four real-world corpora demonstrate that our method achieves significant improvements over previous state-of-the-art methods across different domains and languages.
Loading