EIIR: A Community-Authored Benchmark for Endangered Indic Indigenous Languages

EIIR: A Community-Authored Benchmark for Endangered Indic Indigenous Languages

ACL ARR 2025 July Submission1342 Authors

29 Jul 2025 (modified: 24 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We present a culturally-grounded multimodal benchmark of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Our benchmark -- Endangered Indic Indigenous Recipes (EIIR) -- captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find that, despite their capabilities, these models struggle with low-resource, culturally-specific language. However, we observe that providing targeted context -- including background information about the languages, translation examples, and guidelines for cultural preservation -- leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the EIIR benchmark to the NLP community, hoping it motivates the development of language technologies for endangered languages.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: corpus creation, benchmarking, datasets for low resource languages

Contribution Types: Data resources

Languages Studied: Ho, Khortha, Sadri, Santhali, Mundari, Assamese, Meitei, Khasi, Bodo, Kaman Mishmi

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

A2 Elaboration: We do not foresee misuse of this dataset

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: N/A

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: 3

B3 Artifact Use Consistent With Intended Use: No

B3 Elaboration: We do not use artifacts created by others

B4 Data Contains Personally Identifying Info Or Offensive Content: Yes

B4 Elaboration: 3

B5 Documentation Of Artifacts: Yes

B5 Elaboration: 3

B6 Statistics For Data: Yes

B6 Elaboration: 3

C Computational Experiments: Yes

C1 Model Size And Budget: No

C1 Elaboration: LLMs we used for our experiments were called through APIs. We have cited the LLM technical where appropriate. The report should have everything related to the LLMs we used

C2 Experimental Setup And Hyperparameters: N/A

C3 Descriptive Statistics: Yes

C3 Elaboration: 5,6

C4 Parameters For Packages: N/A

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: Yes

D1 Elaboration: Appendix C

D2 Recruitment And Payment: Yes

D2 Elaboration: 3.4

D3 Data Consent: Yes

D3 Elaboration: 3,4

D4 Ethics Review Board Approval: Yes

D4 Elaboration: 3

D5 Characteristics Of Annotators: Yes

D5 Elaboration: 3

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 1342

Loading