Persona-aware Generative Model for Code-mixed Language

Ayan Sengupta; Md Shad Akhtar; Tanmoy Chakraborty

Persona-aware Generative Model for Code-mixed Language

Ayan Sengupta, Md Shad Akhtar, Tanmoy Chakraborty

Published: 22 Oct 2024, Last Modified: 22 Oct 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Code-mixing and script-mixing are prevalent across online social networks and multilingual societies. However, a user's preference toward code-mixing depends on the socioeconomic status, demographics of the user, and the local context, which existing generative models tend to ignore while generating code-mixed texts. In this work, we make a pioneering attempt to develop a persona-aware generative model to generate texts resembling real-life code-mixed texts of individuals. We propose PARADOX, a persona-aware generative model for code-mixed text generation, which is a novel Transformer-based encoder-decoder model that encodes an utterance conditioned on a user's persona and generates code-mixed texts without monolingual reference data. We propose an alignment module that re-calibrates the generated sequence to resemble real-life code-mixed texts. PARADOX generates code-mixed texts that are semantically more meaningful and linguistically more valid. To evaluate the personification capabilities of PARADOX, we propose four new metrics -- CM BLEU, CM Rouge-1, CM Rouge-L and CM KS. On average, PARADOX achieves $1.6$% better CM BLEU, $57$% better perplexity and $32$% better semantic coherence than the non-persona-based counterparts.

Submission Length: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=FSwqR0NKmz

Changes Since Last Submission: We thank the action editor and reviewers for their constructive comments. We address the comments made by the action editor in our previous submission to TMLR. **Comment 1** User ID. The need to condition on a user ID (instead of, say, the previous utterances of a user) limits the method to ``warm-start" users. It's possible that it would be enough to recognize this limitation early in the paper. **Response.** As argued in Section 3, explicit user ID allows us to encode the temporal evolution of a user's persona. We also elaborate on how to use the unique identifier in ''warm-start" scenarios. The ablation studies in Tables 2 and 3 also highlight the effectiveness of the unique user ID for a personalized code-mixed generation. For instance, modeling the user persona with the explicit user ID improves the validation perplexity by $48$%. **Changes to manuscript:** Updated Section 3.2.1 and Section 5. **Comment 2** Contextual persona module. The reviewers found the parametrization of this module, and, in particular, the latent space modelling, to require additional justification given the claims made in the paper (notably that it "capture[s] the contextual perturbations in the user persona"). The reviewers recognized that the module was shown to provide generation improvements (Table 3) but wondered how critical the proposed approach was in obtaining these results. For example, one reviewer suggests running an ablation where a simple neural network replaces the latent modelling part to start exploring this question. **Response.** In Section 3.2.2, we argue that contextualizing the user persona helps our model capture the contextual perturbation in the user's persona. Encouraged by the suggestion of the action editor, we evaluate an ablation of PARADOX, denoted by **linear persona module**, where the variational persona module is replaced with a linear mapping. Notably, the results highlighted in Figure 3 suggest that the randomized (variational) persona module achieves better perplexity, CM BLEU and CM Rouge scores than the linear counterpart and stabilizes the generative model. **Changes to manuscript:** Updated Section 3.2.2 and Section 5. Added Figure 3. **Comment 3** CM evaluation metrics. The claim that the proposed CM-based evaluation metrics evaluate personalization is dissonant with what the methods seem to capture. In particular, the proposed evaluation metrics seem to evaluate a restricted version of personalization that doesn't consider which tokens are code-mixed. Perhaps clarifying the definition of personalization would help align these ideas. **Response.** We define ''personalized code-mixing" in Section 3.1 as a behavioral language that captures a user's writing style given the historical trends, social context, targeted demographics, and topical relevance. Our proposed metrics - CM BLEU, CM Rouge, and CM KS- calculate the similarity between the linguistic patterns of model-generated texts and the users' historical utterances. By doing so, these metrics can capture different aspects of the personalized usage of code-mixing for different users. **Changes to manuscript:** Modified section 2. Added Section 3.1. Updated Section 4.2. **Other major changes.** * Elaborated on the code-mixed generation algorithm in Section 3.4 * Added discussion on how our methodologies can be extended to other code-mixed language pairs in Section 7.

Code: https://github.com/victor7246/PARADOX

Supplementary Material: zip

Assigned Action Editor: ~Laurent_Charlin1

Submission Number: 2972

Loading