Abstract: Developing more data-efficient training approaches depends on a better understanding of inductive biases.
In this work, we hypothesize that the structural information encoded in a transformer's attention matrices is key to acquiring syntax because attention captures relationships between words -- a crucial part of syntax. Under this hypothesis, we would expect that inductive biases targeting attention should selectively improve data-efficiency on syntactic benchmarks.
We use knowledge distillation (KD) as a methodological lens to test this hypothesis, comparing conventional KD through output logits against KD through attention matrices.
Using GPT-2 as our teacher model, we train student models on datasets ranging from 10K to 5M sentences and evaluate them on both syntactic benchmarks and general language modeling tasks.
Surprisingly, we find that while logit-based KD drastically improves data-efficiency across all metrics, attention-based KD offers minimal benefits even for syntactic tasks. This suggests that logits already effectively supervise syntactic information, challenging assumptions about how syntax is represented in transformers and informing more targeted approaches to data-efficient training.
Paper Type: Short
Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics
Research Area Keywords: cognitive modeling, data-efficient training, distillation
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 4772
Loading