Abstract: Speech-to-Speech Translation (S2ST) converts speech from one language to speech in a different language. While various S2ST models exist, none adequately support Indic languages, primarily due to the lack of a suitable dataset. We fill this gap by introducing Indic-S2ST, a multilingual and multimodal many-to-many S2ST data of approximately 600 hours in 14 Indic languages, including Indian-accented English. To the best of our knowledge, this is the largest data for the S2ST task with parallel speech and text in 14 scheduled Indic languages. Our data also supports Automatic Speech Recognition (ASR), Text-to-Speech (TTS) synthesis, Speech-to-Text translation (ST), and Machine Translation (MT) due to parallel speech and text alignment. Thus, our data may be useful to train a model like Meta’s SeamlessM4T for Indic languages. We also pretrain Indic-S2UT, a discrete unit-based S2ST model for Indic languages. To showcase the utility of the data, we present baseline results on the Indic-S2ST data using the Indic-S2UT. The dataset and codes are available at https://anonymous.4open.science/r/Indic-S2ST-2129/README.md.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation, benchmarking, language resources, multilingual corpora, NLP datasets, datasets for low resource languages, automatic speech recognition, speech technologies, spoken dialog, spoken language translation, spoken language understanding
Contribution Types: Data resources
Languages Studied: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Manipuri, Marathi, Oriya, Punjabi, Tamil, Telugu, Urdu
Previous URL: https://openreview.net/forum?id=QQ4vUgwKm7
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).
Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 1, 2
B2 Discuss The License For Artifacts: No
B2 Elaboration: Publicly available
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: 2
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: 2
B6 Statistics For Data: Yes
B6 Elaboration: 2
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 3
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 3
C3 Descriptive Statistics: Yes
C3 Elaboration: 2
C4 Parameters For Packages: Yes
C4 Elaboration: 2
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: 2
D2 Recruitment And Payment: No
D2 Elaboration: Students volunteered to validate the data.
D3 Data Consent: Yes
D3 Elaboration: 2
D4 Ethics Review Board Approval: Yes
D4 Elaboration: 2
D5 Characteristics Of Annotators: Yes
D5 Elaboration: 2
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 685
Loading