PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in IndiaDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 23 May 2023CoRR 2023Readers: Everyone
Abstract: This paper introduces PMIndiaSum, a new multilingual and massively parallel headline summarization corpus focused on languages in India. Our corpus covers four language families, 14 languages, and the largest to date, 196 language pairs. It provides a testing ground for all cross-lingual pairs. We detail our workflow to construct the corpus, including data acquisition, processing, and quality assurance. Furthermore, we publish benchmarks for monolingual, cross-lingual, and multilingual summarization by fine-tuning, prompting, as well as translate-and-summarize. Experimental results confirm the crucial role of our data in aiding the summarization of Indian texts. Our dataset is publicly available and can be freely modified and re-distributed.
0 Replies

Loading