Data Augmentation for Less Resourced Summarization

ACL ARR 2024 April Submission491 Authors

16 Apr 2024 (modified: 15 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: To improve automatic text summarization for less-resourced languages, we explore fine-tuning multilingual pre-trained models in each language with additional data beyond the human-written summaries. We explore three data augmentation strategies to make use of unlabeled Wikipedia articles as additional synthetic training data. We find that the addition of comparatively small amounts of extra data leads to an improvement in ROUGE scores and that the models trained using extractive target summaries maintain novelty above that of models trained on non-extractive targets. We show that the data augmentation strategies lead to improvements in ROUGE scores for each language, and that the best performing augmentation strategy differs across languages.
Paper Type: Short
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: Less resourced languages, low resourced languages, summarization, data augmentation, synthetic data, low resource, less resourced
Contribution Types: Approaches to low-resource settings
Languages Studied: Sorani Kurdish, Kurmanji Kurdish, Haitian Creole, Georgian, Armenian, Khmer, Macedonian
Submission Number: 491
Loading