Revealing Redundant Syntax in Large Language Models through Multi-Hop Dependency Paths

Revealing Redundant Syntax in Large Language Models through Multi-Hop Dependency Paths

ACL ARR 2025 May Submission3265 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We present the first systematic analysis of attention heads for syntactic relations in decoder-only Transformer language models. Prior work has demonstrated that encoder-only and encoder-decoder architectures contain attention heads aligned with single-hop syntactic relations, but the internal mechanisms of decoder-only models remain underexplored. Focusing on two representative families (GPT-2 and XGLM) across five model sizes (117M, 345M, 774M, 1.5B, 1.7B parameters), we identify a novel class of attention heads that capture multi-hop dependency paths (MDPs), e.g., “obl+case”. Through controlled head-ablation on the BLiMP benchmark, we show that removing 25\% MDP heads induces 7.1\% drop in average grammaticality accuracy, compared to only 1.6\% drop when ablating the same number of conventional, single-hop syntactic heads. Crucially, this pattern holds consistently across all five model sizes, demonstrating the robustness of our findings. Technically, we (i) extend existing head-identification methods—previously limited to encoder-only and encoder-decoder models—to the decoder-only setting, and (ii) propose a formal definition and detection algorithm for MDP heads. Our results reveal that decoder-only Transformers internalize syntactic information in more complex, non-canonical forms than previously understood, underscoring the importance of cross-chain interactions for grammatical competence.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: probing, explanation faithfulness

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 3265

Loading