Domain-specific knowledge distillation yields smaller and better models for conversational commerce

Kristen Howell

15 Dec 2023OpenReview Archive Direct UploadReaders: Everyone

Abstract: In the context of conversational commerce, where training data may be limited and low latency is critical, we demonstrate that knowledge distillation can be used not only to reduce model size, but to simultaneously adapt a contextual language model to a specific domain. We use Multilingual BERT (mBERT; Devlin et al., 2019) as a starting point and follow the knowledge distillation approach of Sanh et al. (2019) to train a smaller multilingual BERT model that is adapted to the domain at hand. We show that for in-domain tasks, the domainspecific model shows on average 2.3% improvement in F1 score, relative to a model distilled on domain-general data. Whereas much previous work with BERT has fine-tuned the encoder weights during task training, we show that the model improvements from distillation on in-domain data persist even when the encoder weights are frozen during task training, allowing a single encoder to support classifiers for multiple tasks and languages.

0 Replies