Cross-Lingual Transfer Learning for Santali Speech Recognition

Cross-Lingual Transfer Learning for Santali Speech Recognition

ACL ARR 2025 May Submission4264 Authors

19 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: India, a country with a large population, possesses two official and twenty-two scheduled languages, making it the most linguistically diverse nation. Despite being one of the scheduled languages, Santali remains a low-resource language. Although Ol Chiki is recognized as the official script for Santali, many continue to use Bengali, Devanagari, Odia, and Roman scripts. In tribute to the upcoming centennial anniversary of the Ol Chiki script, we present an Automatic Speech Recognition for Santali in the Ol Chiki script. Our approach involves cross-lingual transfer learning by utilizing the Whisper framework pre-trained in Bengali and Hindi on the Santali language, using Ol Chiki script transcriptions. With the adoption of the Bengali pre-trained framework, we achieved a Word Error Rate (WER) score of 23.59 %, whereas the adaptation of the Hindi pre-trained framework resulted in a score of 28.75 % WER. These outcomes were obtained using the Whisper Small framework.

Paper Type: Short

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: Speech Recognition, Santali, Ol Chiki, Indian language, Low-Resource, Cross-Lingual

Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: Santali

Submission Number: 4264

Loading