Towards Linguistically Robust NLG Systems for Localized Virtual Assistants

Anonymous

Towards Linguistically Robust NLG Systems for Localized Virtual Assistants

Anonymous

05 Jun 2022 (modified: 05 May 2023)ACL ARR 2022 June Blind SubmissionReaders: Everyone

Keywords: NLG, natural language generation, localization, internationalization, multilingual, linguistic, dialog, dialogue, virtual assistants, dataset, evaluation

Abstract: One of the biggest challenges for localizing the natural language generation of virtual assistants like Alexa, the Google Assistant, or Siri, to many languages, is the proper handling of entities. Neural machine translation systems may translate entities literally, or introduce grammar mistakes by using the wrong inflections. The diversity of linguistic phenomena for entities across all languages is vast, yet ensuring grammatical correctness for a broad diversity of entities is critical -- native speakers may find entity-related grammatical errors silly, jarring, or even offensive.To assess linguistic robustness, we create a multilingual corpus of linguistically significant entities annotated by linguist experts. We also share a simple algorithm for how to leverage this corpus to produce linguistically diverse training and evaluation datasets. Using the Schema-Guided Dialog Dataset (DSTC8) as a test bed, we collect human translations for a subset of linguistically boosted examples to establish quality baselines for neural, template-based, and hybrid NLG systems in French (high-resource), Marathi (low-resource), and Russian (highly inflected language). We make our corpus and the derived translation-based datasets available for further research.

Paper Type: long

0 Replies

Loading