AdaSwitch: Balancing Exploration and Guidance in Knowledge Distillation via Adaptive Switching

AdaSwitch: Balancing Exploration and Guidance in Knowledge Distillation via Adaptive Switching

ACL ARR 2026 January Submission2121 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: distillation, small language model

Abstract: Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods face a dilemma: off-policy distillation provides high-quality supervision but suffers from exposure bias (training–inference mismatch), while on-policy approaches ensure consistency but are limited by the low quality of student-generated outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation via an adaptive switching mechanism. AdaSwitch allows the student to explore its predictions within its capability and selectively integrates teacher guidance only when divergence exceeds a context-aware threshold. This paradigm preserves generation consistency while ensuring high-quality supervision. Experiments on three datasets demonstrate that AdaSwitch consistently improves accuracy and reasoning capability with moderate overhead.

Paper Type: Long

Research Area: Low-resource Methods for NLP

Research Area Keywords: distillation

Contribution Types: Approaches to low-resource settings

Languages Studied: English

Submission Number: 2121

Loading