Keywords: Neural document ranking, knowledge distillation
Abstract: Knowledge distillation is useful in training a neural document ranking model by employing a teacher to guide model refinement.
As a teacher may not perform well in all cases, over-calibration between the student and teacher models can make training less effective.
This paper studies a generalized KL divergence loss in a weighted form for refining ranking models in searching text documents,
and examines its formal properties in balancing knowledge distillation in adaption to the relative performance of the teacher
and student models. This loss differentiates the role of positive and negative documents for a training query, and
allows a student model to take a conservative or deviate approach in imitating teacher's behavior when
the teacher model is worse than the student model. This paper presents a detailed theoretical analysis with experiments on the behavior and usefulness of this generalized loss
Submission Number: 31
Loading