Avoiding 'generatese': the optimization of NLG systems through fit-for-purpose data collections

University of Eastern Finland DRDHum 2024 Conference Submission14 Authors

Published: 03 Jun 2024, Last Modified: 16 Aug 2024DRDHum 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: NLG systems, data collections, translation, generatese
TL;DR: This paper explores the creation of small but highly fit-for-purpose data collections that can be used to minimize this generation in translations obtained by NLG systems.
Abstract: Most linguistic research on the use and exploitation of Natural Language Generation (NLG) systems, whether through graphical interfaces (as in the case of ChatGPT or Gemini) or without them, has primarily focused on their ability to generate text on the basis of prompts. These systems have a wide range of applications, one of which is the interlingual translation of text. They are also able to generate text from a prompt, ei-ther in response to a question or a request to perform a linguistic task. Their apparent ability to generate coherent text from another text surpasses the functionalities of any previous linguistic resource. A translated text often retains certain traces of the source text and language, a phe-nomenon known as "translationese" (Baker, 1993). With the widespread adoption of machine translation, especially in certain genres, there has been an observable intensi-fication of this phenomenon, which has been termed "post-editese" (Toral, 2019). This can be detected through measurements of specific linguistic aspects and comparisons of human and machine translations using parallel and reference corpora. Recently, AI systems known as Large Language Models (LLMs) have begun to be used in both professional translation and translator training. The potential footprint such systems leave on translated texts could be called ‘generatese’ (Sánchez-Gijón, 2024). The principle of language agnosticism that underlies NLG systems can affect not only the form of discourse (the linguistic features of a text) but also its content (the con-cepts and ideas it contains and how they are developed) (Sánchez-Gijón, 2022; Imran et al., 2023). This paper aims to study the impact of using small, highly fit-for-purpose data collections to optimize NLG systems by reducing the randomness of their re-sponses and mitigating ‘generatese’. We will explore the creation and, in particular, the description of such data collections, along with their potential for enhancing the quality of translations produced by NLG systems.
Submission Number: 14
Loading