Abstract: India, a country with a large population, possesses two official and twenty-two scheduled languages, making it the most linguistically diverse nation. Despite being one of the scheduled languages, Santali remains a low-resource language. Although Ol Chiki is recognized as the official script for Santali, many continue to use Bengali, Devanagari, Odia, and Roman scripts. In tribute to the upcoming centennial anniversary of the Ol Chiki script, we present an Automatic Speech Recognition for Santali in the Ol Chiki script. Our approach involves cross-lingual transfer learning by utilizing the Whisper framework pre-trained in Bengali and Hindi on the Santali language, using Ol Chiki script transcriptions. With the adoption of the Bengali pre-trained framework, we achieved a Word Error Rate (WER) score of 23.59 %, whereas the adaptation of the Hindi pre-trained framework resulted in a score of 28.75 % WER. These outcomes were obtained using the Whisper Small framework.
Paper Type: Short
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: Speech Recognition, Santali, Ol Chiki, Indian language, Low-Resource, Cross-Lingual
Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: Santali
Submission Number: 4264
Loading