TeluguST-46: A Benchmark Corpus and Comprehensive Evaluation for Telugu-English Speech Translation

TeluguST-46: A Benchmark Corpus and Comprehensive Evaluation for Telugu-English Speech Translation

ACL ARR 2025 July Submission469 Authors

28 Jul 2025 (modified: 20 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Despite Telugu being spoken by over 80 million people, speech translation research for this morphologically rich language remains severely underexplored. We address this gap by developing a high-quality Telugu-English speech translation benchmark from 46 hours of manually verified CSTD corpus data (30h/8h/8h train/dev/test split). Our systematic comparison of cascaded versus end-to-end architectures shows that while IndicWhisper + IndicMT achieves the highest performance due to extensive Telugu-specific training data, fine-tuned SeamlessM4T models demonstrate remarkable competitiveness despite using significantly less Telugu-specific training data. This finding suggests that with careful hyperparameter tuning and sufficient parallel data (potentially less than 100 hours), end-to-end systems can achieve performance comparable to cascaded approaches in low-resource settings. While our metric reliability study evaluating BLEU, METEOR, ChrF++, ROUGE-L, TER, and BERTScore against human judgments reveals that traditional metrics provide better quality discrimination than BERTScore for Telugu--English translation. The work delivers three key contributions: a reproducible Telugu--English benchmark, empirical evidence of competitive end-to-end performance potential in low-resource scenarios, and practical guidance for automatic evaluation in morphologically complex language pairs.

Paper Type: Long

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: Spoken Language Translation, Telugu-English Dataset, automatic speech recognition, Machine Translation

Contribution Types: Model analysis & interpretability, Data resources, Surveys

Languages Studied: Telugu,Englsih

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: In Section 2

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: Yes

B5 Elaboration: In Section 2

B6 Statistics For Data: Yes

B6 Elaboration: In Section 2

C Computational Experiments: Yes

C1 Model Size And Budget: No

C1 Elaboration: As i have mentioned the model which im using in section 3 as those were the commonly used models

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: In Section 3

C3 Descriptive Statistics: Yes

C3 Elaboration: In Section 6

C4 Parameters For Packages: No

C4 Elaboration: I have mentioned im using the evaluation metrics in section 6 and given the formulas in section 4 where i have used basic standard configuration so didnt mentioned it

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: No

D1 Elaboration: Its just verfication process so i mentioned the verification protocol in the section 2 adn it doesnt posses any risks

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D3 Elaboration: NA

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: N/A

E1 Elaboration: I have used AI only for paraphrasing no significant contribution

Author Submission Checklist: yes

Submission Number: 469

Loading