Text Complexity Matters Less Than Information Content When Pretraining Language Models

ACL ARR 2025 February Submission6287 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Improving the quality and size of the training corpus is known to enhance overall downstream performance of language models on general language understanding tasks. However, the impact of text complexity on downstream performance is less studied. Text complexity refers to how much easier or harder a text is to read compared to others, taking into account lexical (e.g. vocabulary choice), syntactic (e.g. sentence structure), and semantic complexity (e.g. information content), among others. In this work, we focus on reducing lexical and syntactic complexity, while controlling for semantic complexity. We ask two core questions: (1) Does text complexity matter in pretraining? and (2) How does the text complexity of our pretraining corpora affect the performance of language models on general language understanding tasks? To answer these questions, we simplify human-written texts using a large language model with the goal of retaining the information content and pretrain GPT2-small models on both the original and simplified versions. We show empirical evidence that lexical and syntactic complexity do not significantly affect performance on general language understanding tasks, emphasizing the importance of information content when pretraining language models.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: pre-training,fine-tuning
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 6287
Loading