Bridging the domain gap in cross-lingual document classification

Guokun Lai; Barlas Oguz; Yiming Yang; Veselin Stoyanov

Bridging the domain gap in cross-lingual document classification

Guokun Lai, Barlas Oguz, Yiming Yang, Veselin Stoyanov

25 Sept 2019 (modified: 22 Jun 2025)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone

TL;DR: Semi-supervised Cross-lingual Document Classification

Abstract: The scarcity of labeled training data often prohibits the internationalization of NLP models to multiple languages. Cross-lingual understanding has made progress in this area using language universal representations. However, most current approaches focus on the problem as one of aligning language and do not address the natural domain drift across languages and cultures. In this paper, We address the domain gap in the setting of semi-supervised cross-lingual document classification, where labeled data is available in a source language and only unlabeled data is available in the target language. We combine a state-of-the-art unsupervised learning method, masked language modeling pre-training, with a recent method for semi-supervised learning, Unsupervised Data Augmentation (UDA), to simultaneously close the language and the domain gap. We show that addressing the domain gap in cross-lingual tasks is crucial. We improve over strong baselines and achieve a new state-of-the-art for cross-lingual document classification.

Keywords: cross-lingual, document classification, semi-supervised learning

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/bridging-the-domain-gap-in-cross-lingual/code)

Original Pdf: pdf

4 Replies

Loading