A chunking-based text-tokenization framework incorporating domain knowledge

A chunking-based text-tokenization framework incorporating domain knowledge

ACL ARR 2025 February Submission5153 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: One of the crucial activities in Natural Language Processing (NLP) is to tokenize text to extract features so that data mining models can be applied. Many widely-used tokenization algorithms take the approach of using words as tokens. This approach suffers from the following limitations: (a) Using words as features leads to high dimensionality of the data file generated from text, (b) These algorithms use a one-size fits all approach on the text and extract tokens uniformly without consideration of the prior knowledge available in the domain. Here a novel method is proposed which extracts features by tokenizing text as chunks. Domain specific knowledge is used to generate syntactic rules which are then used to split text documents into Finite State Rule Based Chunks. The chunks are the tokens on which data mining models are then applied to generate insights. The effectiveness of chunk-based tokenization is demonstrated by extracting chunk-based tokens as well as word-based tokens from a document corpus, and performing clustering in both cases. A comparison of the clustering of corpus with chunk tokens vis-à-vis word tokens shows marked improvement in clustering performance in the former.

Paper Type: Long

Research Area: Syntax: Tagging, Chunking and Parsing

Research Area Keywords: Chunking, part-of-speech tagging, grammar and knowledge-based approaches

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 5153

Loading