Abstract: In the context of conversational commerce,
where training data may be limited and low
latency is critical, we demonstrate that knowledge distillation can be used not only to reduce
model size, but to simultaneously adapt a contextual language model to a specific domain.
We use Multilingual BERT (mBERT; Devlin
et al., 2019) as a starting point and follow the
knowledge distillation approach of Sanh et al.
(2019) to train a smaller multilingual BERT
model that is adapted to the domain at hand.
We show that for in-domain tasks, the domainspecific model shows on average 2.3% improvement in F1 score, relative to a model distilled
on domain-general data. Whereas much previous work with BERT has fine-tuned the encoder weights during task training, we show
that the model improvements from distillation
on in-domain data persist even when the encoder weights are frozen during task training,
allowing a single encoder to support classifiers
for multiple tasks and languages.
0 Replies
Loading