Leyzer: A Dataset for Multilingual Virtual AssistantsOpen Website

Published: 2020, Last Modified: 27 Jun 2023TDS 2020Readers: Everyone
Abstract: In this article we present the Leyzer dataset, a multilingual text corpus designed to study multilingual and cross-lingual natural language understanding (NLU) models and the strategies of localization of virtual assistants. The proposed corpus consists of 20 domains across three languages: English, Spanish and Polish, with 186 intents and a wide range of samples, ranging from 1 to 672 sentences per intent. We describe the data generation process, including creation of grammars and forced parallelization. We present a detailed analysis of the created corpus. Finally, we report the results for two localization strategies: train-on-target and zero-shot learning using multilingual BERT models.
0 Replies

Loading