A Novel Efficient and Effective Preprocessing Strategy for Text ClassificationDownload PDF


16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: Text classification is an essential task of natural language processing. Preprocessing, which determines the representation of text features, is one of the key steps of text classification architecture. This paper proposes a novel efficient and effective preprocessing strategy with three methods for text classification using OMP algorithm to complete the classification. The main idea of our new preprocessing strategy is that we combine regular filtering and/or stopwords removal with tokenization and lowcase convertion, which can effectively reduce the feature dimension and improve the quality of text feature matrix to some extent. Simulation tests on 20Newsgroups dataset show compared with the existing state-of-the-art method, our new best method reduces the number of features by 19.85$\%$, 34.35$\%$, 26.25$\%$, and 38.67$\%$, and increase the speed of text classification by 17.38\%, 25.64\%, 23.76\%, and 33.38\% with similar classification accuracy on religion, computer, science and sport data, respectively.
0 Replies
