Abstract: Modern language models, such as GPT-3, BERT, and LLaMA, are notoriously data-hungry, requiring millions to over a trillion tokens of training data. Yet, transformer-based learning models have demonstrated a remarkable ability to learn natural languages. After sufficient training, they can consistently distinguish grammatical from ungrammatical sentences. Children as young as 14 months already have the capacity to learn grammar rules from very few examples, even in the presence of non-rule-following exceptions. Yang’s (2016) Tolerance Principle (TP) predicts an all-or-none threshold (N/lnN) on how many exceptions in a training set are tolerable for a rule to be learnable by humans; beyond the TP threshold, instead of a generalizable rule, the exceptions and rule exemplars need to be memorized. To test for TP-like effects, we use BabyBERTa (Huebner et al. 2021), a transformer-based language model optimized for unsupervised training on smaller corpora than most LLMs. We train it on a very simple rule with very small training sets. BabyBERTa can learn the rule from datasets of under 1,000 tokens. We test the effect on learning of varying the type and token frequency of exemplars vs. exceptions. The learning follows a continuous gradient with no evidence of any TP threshold effect.
Paper Type: Long
Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics
Research Area Keywords: linguistic theories, cognitive modeling, computational psycholinguistics
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 591
Loading