CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China

CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China

ACL ARR 2025 May Submission7864 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.

Paper Type: Short

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking; language resources; multilingual corpora; datasets for low resource languages

Contribution Types: Data resources

Languages Studied: Tibetan;Uyghur;Mongolian(Traditional)

Submission Number: 7864

Loading