Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically
Keywords: transformers, language models, hierarchical generalization, subnetworks, pruning, inductive biases, bayesian inference
Abstract: Transformers trained on natural language data have been shown to exhibit hierarchical generalization without explicitly encoding any structural bias. In this work, we investigate sources of inductive bias in transformer models and their training that could cause such preference for hierarchical generalization. We extensively experiment with transformers trained on five synthetic, controlled datasets using several training objectives and show that while objectives such as sequence-to-sequence modeling, classification, etc., often failed to lead to hierarchical generalization, language modeling objective consistently led to transformers generalizing hierarchically. We then study how different generalization behaviors emerge during the training by conducting pruning experiments that reveal joint existence of subnetworks within the model implementing different generalizations. Finally, we take a Bayesian perspective to understand transformers' preference for hierarchical generalization: We establish a correlation between whether transformers generalize hierarchically on a dataset and if the simplest explanation of that dataset is provided by a hierarchical grammar compared to regular grammars exhibiting linear generalization.
Overall, our work presents new insights on the origins of hierarchical generalization in transformers and provides a theoretical framework for studying generalization in language models.
Submission Number: 14
Loading