Physics of Language Models: Part 1, Learning Hierarchical Language Structures

Published: 21 Dec 2025, Last Modified: 21 Dec 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Transformer-based language models are effective but complex, and understanding their inner workings and reasoning mechanisms remains a significant challenge. Previous research has primarily explored how these models handle simple tasks such as name copying or selection; we extend this line of work by investigating how they perform recursive language structure reasoning defined by context-free grammars (CFGs). We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating long (e.g., hundreds of tokens), locally ambiguous sentences that require dynamic programming to parse. Despite this complexity, we demonstrate that autoregressive language models such as GPT can accurately learn and reason over these CFG-defined hierarchical languages and generate valid continuations. Analyzing model internals in this controlled setting, we reveal that hidden states linearly encode CFG parse structure, and that attention patterns align closely with the information flow of dynamic-programming parsing algorithms. This paper also presents several corollary findings, including: why absolute positional embeddings are inferior to relative and rotary embeddings; why uniform attention alone is surprisingly effective (motivating our follow-up work on Canon layers); why encoder-only models (e.g., BERT, DeBERTa) struggle with *deep* structural reasoning on CFGs compared to autoregressive models (e.g., GPT); and why injecting structural or syntactic noise into pretraining data markedly improves robustness to corrupted language prompts.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: In this camera-ready version, I moved a small number of figures and text from the appendix back into the main body and slightly expanded the abstract and conclusion. I also made minor adjustments to the placement of text and figures to improve clarity and visualization.
Video: http://youtu.be/kf_eGgVtOcs
Code: https://physics.allen-zhu.com/part-1
Assigned Action Editor: ~Jonathan_Berant1
Submission Number: 5521
Loading