- TL;DR: Knowledge distillation for cross-lingual language model alignment with state-of-the-art results on XNLI
- Abstract: Current state-of-the-art results in multilingual natural language inference (NLI) are based on tuning XLM (a pre-trained polyglot language model) separately for each language involved, resulting in multiple models. We reach significantly higher NLI results with a single model for all languages via multilingual tuning. Furthermore, we introduce cross-lingual knowledge distillation (XD), where the same polyglot model is used both as teacher and student across languages to improve its sentence representations without using the end-task labels. When used alone, XD beats multilingual tuning for some languages and the combination of them both results in a new state-of-the-art of 79.2% on the XNLI dataset, surpassing the previous result by absolute 2.5%. The models and code for reproducing our experiments will be made publicly available after de-anonymization.
- Code: https://drive.google.com/open?id=1QZ9VnQYWRPtNtdyvep4cF6xI2kIi_3El
- Keywords: cross-lingual transfer, sentence embeddings, polyglot language models, knowledge distillation, natural language inference, embedding alignment, embedding mapping