An Efficient Method for Generating, Storing and Matching Features for Text Mining

Shing-Kit Chan, Wai Lam

Published: 2009, Last Modified: 14 Jan 2026PAKDD 2009EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Log-linear models have been widely used in text mining tasks because it can incorporate a large number of possibly correlated features. In text mining, these possibly correlated features are generated by conjunction of features. They are usually used with log-linear models to estimate robust conditional distributions. To avoid manual construction of conjunction of features, we propose a new algorithmic framework called F-tree for automatically generating and storing conjunctions of features in text mining tasks. This compact graph-based data structure allows fast one-vs-all matching of features in the feature space which is crucial for many text mining tasks. Based on this hierarchical data structure, we propose a systematic method for removing redundant features to further reduce memory usage and improve performance. We do large-scale experiments on three publicly-available datasets and show that this automatic method can get state-of-the-art performance achieved by manual construction of features.

External IDs:dblp:conf/pakdd/ChanL09