Abstract: Text preprocessing is a fundamental component of Natural Language Processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare text as input for further processing and analysis. Despite the context-dependent nature of the above techniques, traditional methods, such as the Porter stemmer, usually ignore contextual information. Moreover, low-resource languages frequently lack the comprehensive linguistic resources needed for defining traditional preprocessing techniques. In this paper, we investigate the idea of using Large Language Models (LLMs) to perform various preprocessing tasks, due to their ability to understand context without requiring extensive language-specific annotated resources. Through a comprehensive evaluation, we compare LLM-based preprocessing - specifically stopword removal, lemmatization and stemming - to traditional algorithms across multiple text classification tasks
in five European languages. Our analysis shows that LLMs can correctly replicate traditional lemmatization and stemming methods with up to 83% accuracy. Additionally, we show that ML algorithms trained on texts preprocessed by LLMs achieve an improvement of up to 8.6% with respect to the $F_1$ measure compared to traditional techniques. Our code and results are publicly available at https://anonymous.4open.science/r/llm_pipeline-7B0D/.
Paper Type: Long
Research Area: Phonology, Morphology and Word Segmentation
Research Area Keywords: Lemmatization, Stemming, Stopwords, Prompting, Large Language Models
Contribution Types: Model analysis & interpretability
Languages Studied: English, French, Italian, Spanish, German
Submission Number: 4387
Loading