Whispering in Ol Chiki: Cross-Lingual Transfer Learning for Santali Speech Recognition

Whispering in Ol Chiki: Cross-Lingual Transfer Learning for Santali Speech Recognition

ACL ARR 2025 July Submission1181 Authors

29 Jul 2025 (modified: 19 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: India, a country with a large population, possesses two official and twenty-two scheduled languages, making it the most linguistically diverse nation. Despite being one of the scheduled languages, Santali remains a low-resource language. Although Ol Chiki is recognized as the official script for Santali, many continue to use Bengali, Devanagari, Odia, and Roman scripts. In tribute to the upcoming centennial anniversary of the Ol Chiki script, we present an Automatic Speech Recognition for Santali in the Ol Chiki script. Our approach involves cross-lingual transfer learning by utilizing the Whisper framework pre-trained in Bengali and Hindi on the Santali language, using Ol Chiki script transcriptions. With the adoption of the Bengali pre-trained framework, we achieved a Word Error Rate (WER) score of 28.47%, whereas the adaptation of the Hindi pre-trained framework resulted in a score of 34.50% WER. These outcomes were obtained using the Whisper Small framework.

Paper Type: Long

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: Speech Recognition, Santali, Ol Chiki, Indian language, Low-Resource, Cross-Lingual

Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: Santali

Previous URL: https://openreview.net/forum?id=QggVnKypmw

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).

Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Section 1, Section 4.1

B2 Discuss The License For Artifacts: No

B2 Elaboration: We did not explicitly discuss licenses in the paper. However, we used publicly available resources such as Whisper (MIT License), Mozilla Common Voice (CC0), and IndicVoices (open-access for research).

B3 Artifact Use Consistent With Intended Use: No

B3 Elaboration: We did not explicitly discuss artifact usage consistency in the paper. However, all artifacts—including Whisper, Common Voice, and IndicVoices—were used strictly for academic research, consistent with their intended use and licensing. No derived data or models were used or distributed beyond the research context.

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Section 1

B6 Statistics For Data: Yes

B6 Elaboration: Section 3

C Computational Experiments: Yes

C1 Model Size And Budget: N/A

C2 Experimental Setup And Hyperparameters: N/A

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 5

C4 Parameters For Packages: N/A

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 1181

Loading