Chinese Internet Dialogue Corpus: A High-Quality Dataset from Social Media

ACL ARR 2025 May Submission4138 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recently, large-scale dialogue datasets have attracted increasing attention in the research community. Many previous studies have constructed dialogue datasets by gathering massive amount of raw data from social media platforms and converting them into dialogues using rule-based methods. However, the usability of such datasets for training is highly dependent on the quality of the raw data. Unfortunately, most raw data from major social media platforms are unstructured and noisy, making it challenging to generate clean dialogue datasets using only rule-based approaches. To address this issue, we propose a novel transfer method that combines model-based and rule-based techniques to process raw data collected from social media platforms. In addition, we introduce a novel scoring method for evaluating the quality of dialogue datasets. Our experiments find a correlation between our scoring method and human judgments of dialogue quality. Using this method, we further evaluate our proposed dataset and compare it with other existing dialogue datasets. Consequently, we present the Chinese Internet Dialogue Corpus, which contains 3,102,235 short-text dialogues, sourced from Baidu Tieba, a popular Chinese social media platform. The Chinese Internet Dialogue Corpus, including both the code and dataset, will be publicly available soon at https://github.com/anonymous20250123/emnlp2025.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: data resources, data analysis, evaluation and metrics, dialogue modeling, social media, Chinese language
Contribution Types: Data resources, Data analysis
Languages Studied: Chinese
Submission Number: 4138
Loading