Abstract: Maithili is one of the 22 official languages recognized in the Indian Constitution. The literature of Maithili is rich; however, due to current socio-political changes, the language is on the verge of extinction. Therefore, it is crucial to develop a corpus for low-resource Indic languages like Maithili to ensure that the dream of ``No Language Left Behind" (NLLB) is realized. With this in mind, we contribute a corpus (1,05,600 sentences) containing both manually curated and synthetically generated. Additionally, we propose a strong baseline on the Maithali-Hindi pair using our data, surpassing the baseline achievable through existing NLLB data.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation, benchmarking, language resources, NLP datasets, automatic evaluation of datasets, evaluation methodologies, evaluation, datasets for low resource languages, metrics, reproducibility
Contribution Types: Publicly available software and/or pre-trained models, Data resources
Languages Studied: Maithili, Hindi
Previous URL: https://openreview.net/forum?id=Z5kg21e2Dt
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).
Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Section Number 6
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: section 1, footnote
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Section 6
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Appendix A.6
B4 Data Contains Personally Identifying Info Or Offensive Content: Yes
B4 Elaboration: Section 6
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Section 2 , Table 1
B6 Statistics For Data: Yes
B6 Elaboration: subsection 2.4 Appendix A.2, Table 5 and 6
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Appendix A.3
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Appendix A.3
C3 Descriptive Statistics: Yes
C3 Elaboration: subsection 3.4, Table 3 and 4
C4 Parameters For Packages: Yes
C4 Elaboration: subsection 3.3, Appendix A.4
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: subsection 2.4
D2 Recruitment And Payment: N/A
D3 Data Consent: Yes
D3 Elaboration: section 2, Appendix A.1
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: Yes
D5 Elaboration: subsection 2.4, first paragraph
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 293
Loading