Text Complexity Alone Does Not Matter in Pretraining Language Models

ACL ARR 2025 May Submission7991 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Improving the quality and size of the training corpus is known to enhance overall downstream performance of language models on general language understanding tasks. However, the impact of text complexity on downstream performance has been less studied. Text complexity refers to how hard a text is to read, and is typically estimated from surface cues such as word choice, sentence length, and vocabulary diversity while we keep the underlying text content constant. Our approach reduces surface-level complexity—shorter sentences, simpler words, lower vocabulary diversity—while keeping core text content constant. We ask two core questions: (1) Does text complexity matter in pretraining? and (2) How does the text complexity of our pretraining corpora affect the performance of language models on general language understanding tasks? To answer these questions, we simplify human-written texts using a large language model (with the goal of retaining the core text content) and pretrain GPT2-small models on both the original and simplified versions. We show empirical evidence that reducing surface-level complexity does not significantly affect performance on general language understanding tasks, indicating that there are other corpus characteristics that play a more important role.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: pre-training, fine-tuning
Contribution Types: Data analysis
Languages Studied: English
Submission Number: 7991
Loading