AlignedAug: Alignment of naturalness and coherence score distributions for real dialogue augmentation

AlignedAug: Alignment of naturalness and coherence score distributions for real dialogue augmentation

ACL ARR 2026 January Submission8007 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Dialogue Augmentation, Large Language Models (LLMs), Data Distribution Alignment, Multi-party Chat, Utterance Classification, Response Selection, Summarization

Abstract: Synthetic dialogues generated by Large Language Models (LLMs) exhibit differences from real dialogues in linguistic attributes such as naturalness or sentence completeness. To bridge this gap, we propose AlignedAug, a framework for realistic dialogue augmentation. AlignedAug consists of two stages: (1) Cognition-aware Dialogue Generation, which produces utterances using an LLM-based model; (2) Distribution Alignment, which is composed of Chat Style Refinement and Statistical Selection. Chat Style Refinement simulates informal, chat-like responses by randomly deleting words except for subjects, verbs, and negations. Statistical Selection selects dialogues whose naturalness and coherence scores are aligned with scores of real dialogues. Experimental results show that the Distribution Alignment stage in AlignedAug reduces the gap between real and synthetic dialogues (CollabChat $S_{KS}$: 0.95 $\rightarrow$ 0.35). In addition, AlignedAug outperforms existing LLM-based augmentation methods on classification and response selection tasks (classification accuracy: 0.65 $\rightarrow$ 0.69, response selection: R@5 0.77 $\rightarrow$ 0.84). These findings demonstrate that AlignedAug provides synthetic data that not only augments dialogues which align real dialogues more closely but also improves the performance of models trained on aligned dialogues.

Paper Type: Long

Research Area: Low-resource Methods for NLP

Research Area Keywords: data augmentation, NLP in resource-constrained settings

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data analysis

Languages Studied: English

Submission Number: 8007

Loading