Unraveling Syntax: How Language Models Learn Context-Free Grammars

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: language modeling, transformers, context-free grammars (CFG), probabilistic context-free grammars (PCFG), recursion, formal language theory
TL;DR: We initiate the study of the dynamics of language models learning Context-Free Grammars, with theoretical and empirical results showing how loss, learning, and generalization behave with respect to the hierarchical (recursive) structure of grammars.
Abstract: We introduce a new approach to understanding how language models acquire syntax. While large models achieve impressive results, little is known about their learning dynamics. Our approach starts with the observation that most domains of interest -- such as natural language syntax, coding languages, arithmetic problems -- are captured by context-free grammars (CFGs). In this work we initiate the study of how language modeling for CFGs behaves with respect to the substructure of CFGs, namely the notion of a ``subgrammar". We define subgrammars, and prove a suite of fundamental results showing that the loss of language modeling obeys recurrences with respect to subgrammars. We show empirically that small transformers learn subgrammars in parallel, unlike children-- who first master simple substructures before progressing to more complex constructions. We further explore whether curriculum learning using an inductive bias, by pretraining on a subgrammar, can improve performance, and use alignment analysis to show definitively that such pre-training results in internal representations that are more aligned with the grammar's substructure. Finally we demonstrate that models struggle with deeper recursive structures (a limitation even of large language models), revealing fundamental challenges in how neural networks represent hierarchical syntax.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Submission Number: 21086
Loading