CMTS: A Dataset and Benchmark for Text Summarization of Minority Languages in China

ACL ARR 2025 February Submission2828 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like text summarization. To address this gap, we introduce a novel dataset, Chinese Minority Text Summarization (CMTS), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for text summarization tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing text summarization in Chinese minority languages and contribute to the development of related benchmarks.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking; language resources;datasets for low resource languages;NLP datasets;multilingual corpora;
Contribution Types: Data resources
Languages Studied: Tibetan;Uyghur;Mongolian (Traditional)
Submission Number: 2828
Loading