TeluguST-46: A Benchmark Corpus and Comprehensive Evaluation for Telugu-English Speech Translation

ACL ARR 2025 July Submission469 Authors

28 Jul 2025 (modified: 20 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite Telugu being spoken by over 80 million people, speech translation research for this morphologically rich language remains severely underexplored. We address this gap by developing a high-quality Telugu-English speech translation benchmark from 46 hours of manually verified CSTD corpus data (30h/8h/8h train/dev/test split). Our systematic comparison of cascaded versus end-to-end architectures shows that while IndicWhisper + IndicMT achieves the highest performance due to extensive Telugu-specific training data, fine-tuned SeamlessM4T models demonstrate remarkable competitiveness despite using significantly less Telugu-specific training data. This finding suggests that with careful hyperparameter tuning and sufficient parallel data (potentially less than 100 hours), end-to-end systems can achieve performance comparable to cascaded approaches in low-resource settings. While our metric reliability study evaluating BLEU, METEOR, ChrF++, ROUGE-L, TER, and BERTScore against human judgments reveals that traditional metrics provide better quality discrimination than BERTScore for Telugu--English translation. The work delivers three key contributions: a reproducible Telugu--English benchmark, empirical evidence of competitive end-to-end performance potential in low-resource scenarios, and practical guidance for automatic evaluation in morphologically complex language pairs.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: Spoken Language Translation, Telugu-English Dataset, automatic speech recognition, Machine Translation
Contribution Types: Model analysis & interpretability, Data resources, Surveys
Languages Studied: Telugu,Englsih
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: In Section 2
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: In Section 2
B6 Statistics For Data: Yes
B6 Elaboration: In Section 2
C Computational Experiments: Yes
C1 Model Size And Budget: No
C1 Elaboration: As i have mentioned the model which im using in section 3 as those were the commonly used models
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: In Section 3
C3 Descriptive Statistics: Yes
C3 Elaboration: In Section 6
C4 Parameters For Packages: No
C4 Elaboration: I have mentioned im using the evaluation metrics in section 6 and given the formulas in section 4 where i have used basic standard configuration so didnt mentioned it
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: No
D1 Elaboration: Its just verfication process so i mentioned the verification protocol in the section 2 adn it doesnt posses any risks
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D3 Elaboration: NA
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: N/A
E1 Elaboration: I have used AI only for paraphrasing no significant contribution
Author Submission Checklist: yes
Submission Number: 469
Loading