Segmentation Strategies and Data Enrichment for Improved Abstractive Summarization of Burmese Language

Hlaing Myat Nwe, Ye Kyaw Thu, Thanaruk Theeramunkong, Kiyoaki Shirai, Thepchai Supnithi

Published: 2024, Last Modified: 28 May 2026PRICAI (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This study addresses the challenges of abstractive text summarization for Burmese, a low-resource language, by proposing a comprehensive approach involving data collection, model fine-tuning, and segmentation strategies. Leveraging the mT5 model, we fine-tuned it for Burmese text summarization using a manually collected dataset from the BBC website. Furthermore, our research emphasizes the importance of segmentation schemes, particularly subword segmentation with SentencePiece, which demonstrated superior performance over syllable, word, and byte-pair encoding (BPE) methods. The findings underscore the critical role of dataset quality and quantity in enhancing summarization outcomes. By improving abstractive summarization techniques for Burmese, this work contributes significantly to the Natural Language Processing (NLP) field and facilitates greater accessibility and usability of Burmese text data.

External IDs:dblp:conf/pricai/NweTTSS24