Keywords: LLMs, surprisal, syntax, dependency grammar, constituency grammar, N400
TL;DR: Surprisal from language models reflects constituency and dependency structure, and semantics differently across architectures.
Abstract: Lexical surprisal, widely used to explain neural responses like the N400 and BOLD signal (Michaelov et al., 2024), is often viewed as a measure of lexical prediction. However, it remains unclear how much it reflects syntactic and semantic structure beyond surface-level cues (Slaats & Martin, 2025). This study quantifies how surprisal is shaped by features such as semantic distance, and constituency and dependency structure. We estimate word-wise surprisal from nine language models: GPT-2 (Radford et al, 2019), Falcon (Penedo et al., 2023), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), BART (Lewis et al., 2020), T5 (Raffel et al., 2020), and N-gram models - widely used in psycholinguistics studies. Analyses were conducted on 60,000 English sentences.
Analyses: We fitted independent linear regression models predicting each of word-wise surprisal estimates from each of the nine LMs, using either (i) word position and lexical frequency alone (which yielded an average $R^2$ 0.446 $\pm$ 0.082 across models) or (ii) structural (constituency, dependency) and semantic predictors alongside the baseline. Results show the gains in explained variance ($\Delta R^2$) from adding structural and semantic predictors. Constituency predictors included syntactic depth and the number of closed phrases per word; dependency predictors captured the number of dependencies per word and the distance to each word's dependency head. Dependency features yielded overall larger $\Delta R^2$, constituency features produced moderate gains, and semantic distance - measured via FastText-based contextual dissimilarity - showed variable effects. GPT-2 and Falcon integrated both dependency and semantic information most strongly. BART showed weaker effects, particularly for constituency. BERT and RoBERTa benefited from syntax, though RoBERTa was less sensitive to semantics. N-gram models relied inconsistently on structure, with modest constituency and semantic effects but relatively strong sensitivity to dependency features, suggesting a shallow structural encoding. Additionally, we conducted SHAP (SHapley Additive exPlanations) analyses to assess each predictor's contribution to the nine surprisal estimates. SHAP quantifies feature impact by computing marginal contributions across feature subsets. Results show that semantic distance was the most influential predictor for transformer LMs, particularly GPT-2, Falcon, and BART. The number of left dependencies was key for T5 and N-gram LMs, while constituent depth notably impacted RoBERTa. Overall, transformers relied more heavily on semantic predictors, whereas N-gram models showed a more balanced or weaker feature profile. Hierarchical clustering confirmed distinct grouping patterns between transformer and N-gram architectures based on predictor sensitivities.
Conclusions: Overall, despite the conspicuous explanatory effects of the baseline predictors, structural predictors enhance surprisal estimates, confirming that surprisal reflects not just surface-level properties. However, the nature and extent of this sensitivity is not universal but varies sharply across models, reflecting their architectural and training distinctions. These findings highlight the importance of evaluating surprisal's informational content in model-specific terms when used to interpret cognitive or neural data, and they encourage an empirically driven definition of surprisal—one that reflects how different models relate to both sequential and structural information. In ongoing work, we test whether model-specific sensitivities to structure and semantics shape how surprisal maps onto neural correlates of predictive processing.
Submission Number: 16
Loading