Small Language Model Learning with Inconsistent Input

Small Language Model Learning with Inconsistent Input

ACL ARR 2025 July Submission591 Authors

28 Jul 2025 (modified: 28 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Modern language models, such as GPT-3, BERT, and LLaMA, are notoriously data-hungry, requiring millions to over a trillion tokens of training data. Yet, transformer-based learning models have demonstrated a remarkable ability to learn natural languages. After sufficient training, they can consistently distinguish grammatical from ungrammatical sentences. Children as young as 14 months already have the capacity to learn grammar rules from very few examples, even in the presence of non-rule-following exceptions. Yang’s (2016) Tolerance Principle (TP) predicts an all-or-none threshold (N/lnN) on how many exceptions in a training set are tolerable for a rule to be learnable by humans; beyond the TP threshold, instead of a generalizable rule, the exceptions and rule exemplars need to be memorized. To test for TP-like effects, we use BabyBERTa (Huebner et al. 2021), a transformer-based language model optimized for unsupervised training on smaller corpora than most LLMs. We train it on a very simple rule with very small training sets. BabyBERTa can learn the rule from datasets of under 1,000 tokens. We test the effect on learning of varying the type and token frequency of exemplars vs. exceptions. The learning follows a continuous gradient with no evidence of any TP threshold effect.

Paper Type: Long

Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics

Research Area Keywords: linguistic theories, cognitive modeling, computational psycholinguistics

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English

Submission Number: 591

Loading