Interpretability of Language Models for Learning Hierarchical Structures

TMLR Paper5521 Authors

01 Aug 2025 (modified: 11 Aug 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge. Previous research has primarily explored how these models handle simple tasks like name copying or selection; we extend this by investigating how they process complex, recursive language structures defined by context-free grammars (CFGs). We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy, locally ambiguous sequences that require dynamic programming to parse. Despite this complexity, we show that generative models like GPT can learn these CFG languages and generate valid completions. Analyzing the model's internals, we find that its hidden states linearly encode parse tree structure (via our new probing technique), and attention patterns statistically align with the information flow of dynamic programming-style parsing algorithms. These provide a controlled interpretability setting for understanding how transformers may represent and compute over hierarchical syntax.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jonathan_Berant1
Submission Number: 5521
Loading