Persona-aware Generative Model for Code-mixed Language

Persona-aware Generative Model for Code-mixed Language

TMLR Paper2121 Authors

30 Jan 2024 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Code-mixing and script-mixing are prevalent across online social networks and multilingual societies. However, a user's preference toward code-mixing depends on the socioeconomic status, demographics of the user, and the local context, which existing generative models tend to ignore while generating code-mixed texts. In this work, we make a pioneering attempt to develop a persona-aware generative model to generate texts resembling real-life code-mixed texts of individuals. We propose PARADOX, a persona-aware generative model for code-mixed text generation, which is a novel Transformer-based encoder-decoder model that encodes an utterance conditioned on a user's persona and generates code-mixed texts without monolingual reference data. We propose an alignment module that re-calibrates the generated sequence to resemble real-life code-mixed texts. PARADOX generates code-mixed texts that are semantically more meaningful and linguistically more valid. To evaluate the personification capabilities of PARADOX, we propose four new metrics -- CM BLEU, CM Rouge-1, CM Rouge-L and CM KS. On average, PARADOX achieves $1.6$ points better CM BLEU, $47\%$ better perplexity and $32\%$ better semantic coherence than the non-persona-based counterparts.

Submission Length: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=OdGz6RzEWl

Changes Since Last Submission: The previous submission was desk rejected due to an incorrect stylefile format. We have updated the submission with the correct TMLR format.

Assigned Action Editor: ~Laurent_Charlin1

Submission Number: 2121

Loading