Investigating Large Language Models’ Linguistic Abilities for Text Preprocessing

Investigating Large Language Models’ Linguistic Abilities for Text Preprocessing

ACL ARR 2025 February Submission4387 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Text preprocessing is a fundamental component of Natural Language Processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare text as input for further processing and analysis. Despite the context-dependent nature of the above techniques, traditional methods, such as the Porter stemmer, usually ignore contextual information. Moreover, low-resource languages frequently lack the comprehensive linguistic resources needed for defining traditional preprocessing techniques. In this paper, we investigate the idea of using Large Language Models (LLMs) to perform various preprocessing tasks, due to their ability to understand context without requiring extensive language-specific annotated resources. Through a comprehensive evaluation, we compare LLM-based preprocessing - specifically stopword removal, lemmatization and stemming - to traditional algorithms across multiple text classification tasks in five European languages. Our analysis shows that LLMs can correctly replicate traditional lemmatization and stemming methods with up to 83% accuracy. Additionally, we show that ML algorithms trained on texts preprocessed by LLMs achieve an improvement of up to 8.6% with respect to the $F_1$ measure compared to traditional techniques. Our code and results are publicly available at https://anonymous.4open.science/r/llm_pipeline-7B0D/.

Paper Type: Long

Research Area: Phonology, Morphology and Word Segmentation

Research Area Keywords: Lemmatization, Stemming, Stopwords, Prompting, Large Language Models

Contribution Types: Model analysis & interpretability

Languages Studied: English, French, Italian, Spanish, German

Submission Number: 4387

Loading