Small Language Model Learning with Inconsistent Input

ACL ARR 2025 May Submission3819 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Modern language models, such as GPT-3, BERT, and LLaMA, are notoriously data-hungry, requiring millions to over a trillion tokens of training data. Yet, transformer-based learning models have demonstrated a remarkable ability to learn natural languages. After sufficient training, they can consistently distinguish grammatical from ungrammatical sentences. Children as young as 14 months already have the capacity to learn abstract grammar rules from very few examples, even in the presence of non-rule-following exceptions. Yang’s (2016) Tolerance Principle specifies an exact threshold for how many exceptions are permissible in a given dataset for a rule to be learnable by humans. We explore the minimal amount and quality of training data necessary for rules to be generalized by language models to test for evidence of Tolerance-Principle-like effects. We implement BabyBERTa (Huebner et al. 2021), a transformer-based language model optimized for training on smaller corpora than most LLMs. We train it on very small artificial grammar sets. For the simplest kind of rule, BabyBERTa can learn from datasets of under 1,000 tokens. The effect of type and token frequency of exemplars vs. exceptions on learning follows a gradient. We see no effect that can be related to the Tolerance Principle.
Paper Type: Long
Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics
Research Area Keywords: linguistic theories, cognitive modeling, computational psycholinguistics
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 3819
Loading