Keywords: NLP, Pretraining Objectives, Spoken Language, Multilingual
Abstract: There has been an increasing interest among NLP researchers towards learning generic representations. However, in the field of multilingual spoken dialogue systems, this problem remains overlooked. Indeed most of the pre-training methods focus on learning representations for written and non-conversational data or are restricted to the monolingual setting. In this work we (1) generalise existing losses to the multilingual setting, (2) develop a new set of losses to leverage parallel conversations when available. These losses improve the learning of representations by fostering the deep encoder to better learn contextual dependencies. The pre-training relies on \texttt{OpenSubtitles}, a huge multilingual corpus that is composed of $24.3$G tokens; a by-product of the pre-processing includes multilingual aligned conversations. We also introduce two new multilingual tasks and a new benchmark on multilingual dialogue act labels called \texttt{MIAM}. We validate our pre-training on the three aforementioned tasks and show that our model using our newly designed losses achieves better performances than existing models. Our implementation will be available on \url{github.com} and preprocessed data will be available in Datasets
0 Replies
Loading