Optimized Term Extraction Method Based on Computing Merged Partial C-ValuesOpen Website

2019 (modified: 06 Nov 2022)ICTERI (Revised Selected Papers) 2019Readers: Everyone
Abstract: Assessing the completeness of a document collection, regarding terminological coverage of a domain of interest, is a complicated task that requires substantial computational resource and human effort. Automated term extraction (ATE) is an important step within this task in our OntoElect approach. It outputs the bags of terms extracted from incrementally enlarged partial document collections for measuring terminological saturation. Saturation is measured iteratively, using our $$ thd $$ measure of terminological distance between the two bags of terms. The bags of retained significant terms $$ T_{i} $$ and $$ T_{i + 1} $$ extracted at i-th and i + 1-st iterations are compared $$ (thd(T_{i} ,T_{i + 1} )) $$ until it is detected that $$ thd $$ went below the individual term significance threshold. The flaw of our conventional approach is that the sequence of input datasets is built by adding an increment of several documents to the previous dataset. Hence, the major part of the documents undergoes term extraction repeatedly, which is counter-productive. In this paper, we propose and prove the validity of the optimized pipeline based on the modified C-value method. It processes the disjoint partitions of a collection but not the incrementally enlarged datasets. It computes partial C-values and then merges these in the resulting bags of terms. We prove that the results of extraction are statistically the same for the conventional and optimized pipelines. We support this formal result by evaluation experiments to prove document collection and domain independence. By comparing the run times, we prove the efficiency of the optimized pipeline. We also prove experimentally that the optimized pipeline effectively scales up to process document collections of industrial size.
0 Replies

Loading