XD: Cross-lingual Knowledge Distillation for Polyglot Sentence Embeddings

Maksym Del; Mark Fishel

XD: Cross-lingual Knowledge Distillation for Polyglot Sentence Embeddings

Maksym Del, Mark Fishel

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone

TL;DR: Knowledge distillation for cross-lingual language model alignment with state-of-the-art results on XNLI

Abstract: Current state-of-the-art results in multilingual natural language inference (NLI) are based on tuning XLM (a pre-trained polyglot language model) separately for each language involved, resulting in multiple models. We reach significantly higher NLI results with a single model for all languages via multilingual tuning. Furthermore, we introduce cross-lingual knowledge distillation (XD), where the same polyglot model is used both as teacher and student across languages to improve its sentence representations without using the end-task labels. When used alone, XD beats multilingual tuning for some languages and the combination of them both results in a new state-of-the-art of 79.2% on the XNLI dataset, surpassing the previous result by absolute 2.5%. The models and code for reproducing our experiments will be made publicly available after de-anonymization.

Code: https://drive.google.com/open?id=1QZ9VnQYWRPtNtdyvep4cF6xI2kIi_3El

Keywords: cross-lingual transfer, sentence embeddings, polyglot language models, knowledge distillation, natural language inference, embedding alignment, embedding mapping

Original Pdf: pdf

10 Replies

Loading