Towards the Development of a LegalNLP Dataset for Bodo and the Evaluation of Abstractive Summarization Models

ACL ARR 2025 May Submission7339 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Natural Language Processing (NLP) has become a transformative tool for analyzing large volumes of unstructured legal text, enabling tasks such as document summarization, judgment prediction, and legal information retrieval. However, most advancements in Legal NLP have been focused on high-resource languages like English, leaving low-resource languages such as Bodo significantly underrepresented. To address this gap, this paper presents the development of a legal training and test dataset for Bodo, a language spoken in Northeast India. Legal case documents and their summaries were sourced from publicly available platforms and translated into Bodo using the IndicTrans2 model, followed by preprocessing and standardization to ensure linguistic consistency and data quality using BLEU score and manual human evaluation. The dataset was also used to evaluate several state-of-the-art abstractive summarization models, including sequence-to-sequence architectures, pretrained transformers, and large language models, with performance assessed using ROUGE and CHRF scores. The findings emphasize the importance of building language-specific resources and provide a foundational benchmark for advancing Legal NLP research in Bodo and other low-resource languages.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Summarization
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: Bodo
Submission Number: 7339
Loading