Abstract: One of the crucial activities in Natural Language Processing (NLP) is to tokenize text to extract features so that data mining models can be applied. Many widely-used tokenization algorithms take the approach of using words as tokens. This approach suffers from the following limitations: (a) Using words as features leads to high dimensionality of the data file generated from text, (b) These algorithms use a one-size fits all approach on the text and extract tokens uniformly without consideration of the prior knowledge available in the domain. Here a novel method is proposed which extracts features by tokenizing text as chunks. Domain specific knowledge is used to generate syntactic rules which are then used to split text documents into Finite State Rule Based Chunks. The chunks are the tokens on which data mining models are then applied to generate insights. The effectiveness of chunk-based tokenization is demonstrated by extracting chunk-based tokens as well as word-based tokens from a document corpus, and performing clustering in both cases. A comparison of the clustering of corpus with chunk tokens vis-à-vis word tokens shows marked improvement in clustering performance in the former.
Paper Type: Long
Research Area: Syntax: Tagging, Chunking and Parsing
Research Area Keywords: Chunking, part-of-speech tagging, grammar and knowledge-based approaches
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 5153
Loading