Evaluating distillation methods for data-efficient syntax learning

Evaluating distillation methods for data-efficient syntax learning

ACL ARR 2025 May Submission4772 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Developing more data-efficient training approaches depends on a better understanding of inductive biases. In this work, we hypothesize that the structural information encoded in a transformer's attention matrices is key to acquiring syntax because attention captures relationships between words -- a crucial part of syntax. Under this hypothesis, we would expect that inductive biases targeting attention should selectively improve data-efficiency on syntactic benchmarks. We use knowledge distillation (KD) as a methodological lens to test this hypothesis, comparing conventional KD through output logits against KD through attention matrices. Using GPT-2 as our teacher model, we train student models on datasets ranging from 10K to 5M sentences and evaluate them on both syntactic benchmarks and general language modeling tasks. Surprisingly, we find that while logit-based KD drastically improves data-efficiency across all metrics, attention-based KD offers minimal benefits even for syntactic tasks. This suggests that logits already effectively supervise syntactic information, challenging assumptions about how syntax is represented in transformers and informing more targeted approaches to data-efficient training.

Paper Type: Short

Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics

Research Area Keywords: cognitive modeling, data-efficient training, distillation

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 4772

Loading