Don't Throw Away Data: Improving Sequence Knowledge Distillation with Minimum Bayes Risk Decoding

Jun Wang; Eleftheria Briakou; Hamid Dadkhahi; Rishabh Agarwal; Colin Cherry; Trevor Cohn

Don't Throw Away Data: Improving Sequence Knowledge Distillation with Minimum Bayes Risk Decoding

Jun Wang, Eleftheria Briakou, Hamid Dadkhahi, Rishabh Agarwal, Colin Cherry, Trevor Cohn

Published: 08 Mar 2025, Last Modified: 13 Apr 2025SSI-FM PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Minimum Bayes Risk, Knowledge Distillation, Machine Translation

TL;DR: Improve knowledge distillation by integrating multiple high-scoring MBR synthetic translations from teacher model into training, yielding performance gain for student models

Abstract: A critical component in knowledge distillation is the means of coupling the teacher and student. The leading sequence knowledge distillation method involves supervised learning of the student against teacher-decoded synthetic outputs, and is exemplified by the current state of the art, which incorporates minimum Bayes risk(MBR) teacher decoding. In this paper we seek to integrate MBR more tightly in distillation training, specifically by using several high-scoring MBR translations, rather than a single selected sequence, thus capturing a rich diversity of teacheroutputs. This approach enhances the diversity of synthetic training data, without requiring further human inputs or labeled data. Our experiments on English to German and English to Japanese translation show consistent improvements over strong baseline methods for both tasks and with varying model sizes. Additionally, we conduct a detailed analysis focusing on data efficiency and capacity curse aspects to elucidate MBR-n and explore its further potential.

Submission Number: 22

Loading