Bridging the domain gap in cross-lingual document classificationDownload PDF

25 Sept 2019 (modified: 22 Oct 2023)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone
TL;DR: Semi-supervised Cross-lingual Document Classification
Abstract: The scarcity of labeled training data often prohibits the internationalization of NLP models to multiple languages. Cross-lingual understanding has made progress in this area using language universal representations. However, most current approaches focus on the problem as one of aligning language and do not address the natural domain drift across languages and cultures. In this paper, We address the domain gap in the setting of semi-supervised cross-lingual document classification, where labeled data is available in a source language and only unlabeled data is available in the target language. We combine a state-of-the-art unsupervised learning method, masked language modeling pre-training, with a recent method for semi-supervised learning, Unsupervised Data Augmentation (UDA), to simultaneously close the language and the domain gap. We show that addressing the domain gap in cross-lingual tasks is crucial. We improve over strong baselines and achieve a new state-of-the-art for cross-lingual document classification.
Keywords: cross-lingual, document classification, semi-supervised learning
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:1909.07009/code)
Original Pdf: pdf
4 Replies

Loading