- TL;DR: Semi-supervised Cross-lingual Document Classification
- Abstract: The scarcity of labeled training data often prohibits the internationalization of NLP models to multiple languages. Cross-lingual understanding has made progress in this area using language universal representations. However, most current approaches focus on the problem as one of aligning language and do not address the natural domain drift across languages and cultures. In this paper, We address the domain gap in the setting of semi-supervised cross-lingual document classification, where labeled data is available in a source language and only unlabeled data is available in the target language. We combine a state-of-the-art unsupervised learning method, masked language modeling pre-training, with a recent method for semi-supervised learning, Unsupervised Data Augmentation (UDA), to simultaneously close the language and the domain gap. We show that addressing the domain gap in cross-lingual tasks is crucial. We improve over strong baselines and achieve a new state-of-the-art for cross-lingual document classification.
- Keywords: cross-lingual, document classification, semi-supervised learning
- Original Pdf: pdf