Abstract: The topic modeling discovers the latent topic probability of the given text documents. To generate the more meaningful topic that better represents the given document, we proposed a new feature selection technique which can be used in the data preprocessing stage. The method consists of three steps. First, it generates the word/word-pair from every single document (Feature generation). Second, it applies a two-way TF-IDF algorithm to word/word-pair for semantic filtering (Feature filtering). Third, it uses the K-means algorithm to merge the word pairs that have the similar semantic meaning (Feature coalescence). Our proposed technique can improve the generated topic accuracy by up to 12.99%.
4 Replies
Loading