Abstract: Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like text summarization. To address this gap, we introduce a novel dataset, Chinese Minority Text Summarization (CMTS), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for text summarization tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing text summarization in Chinese minority languages and contribute to the development of related benchmarks.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking; language resources;datasets for low resource languages;NLP datasets;multilingual corpora;
Contribution Types: Data resources
Languages Studied: Tibetan;Uyghur;Mongolian (Traditional)
Submission Number: 2828
Loading