Advancing NLP Equity: A Secondary Benchmark Evaluation of Multilingual Language Models for Underrepresented Languages

Md Muntaqim Meherab; SALMAN; Md. Maruf Billah; Kazi Shakkhar Rahman; Liza Sharmin; Tanvirul Islam; Z N M Zarif Mahmud; Nuruzzaman Faruqui; Sheak Rashed Haider Noori; Touhid Bhuiyan

Advancing NLP Equity: A Secondary Benchmark Evaluation of Multilingual Language Models for Underrepresented Languages

Md Muntaqim Meherab, SALMAN, Md. Maruf Billah, Kazi Shakkhar Rahman, Liza Sharmin, Tanvirul Islam, Z N M Zarif Mahmud, Nuruzzaman Faruqui, Sheak Rashed Haider Noori, Touhid Bhuiyan

Published: 14 Dec 2025, Last Modified: 09 Jan 2026LM4UC@AAAI2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multilingual PLMs, Zero-shot NLI, Low-resource languages

TL;DR: We audit XNLI-tuned XLM-R on AmericasNLI/XNLI, finding a 56% gap between high- and low-resource languages. A Quechua case study shows multilingual models fail low-resource communities, highlighting the need for simple fairness audits.

Abstract: Recent multilingual language models promise support for “100+ languages,” yet speakers of Indigenous and other underrepresented languages still often do not see themselves in these advances. In this work, we take a deliberately simple, secondary-benchmark perspective: rather than proposing a new model or dataset, we re-evaluate an off-the-shelf multilingual natural language inference (NLI) model on public benchmarks that explicitly include Indigenous languages of the Americas. Concretely, we use the AmericasNLI benchmark for ten Indigenous languages and XNLI for English and Spanish, and we evaluate the widely used joeddav/xlm-roberta-large-xnli model under a fixed, zero-shot protocol. Our goal is to answer three questions: (i) How large is the performance gap between high- resource and underrepresented languages under the same model and task? (ii) Are these gaps consistent across languages, or do some communities fare systematically worse than others? (iii) What kinds of qualitative errors arise, and what do they suggest about cultural and linguistic mismatch? Our experiments reveal a striking discrepancy: while English and Spanish reach almost perfect accuracy on XNLI (around 99.8% on our runs), the same model averages only about 43% accuracy across ten Indigenous languages in AmericasNLI, with none exceeding 47%. We also show qualitative NLI failures in Quechua that point to difficulties with morphology, idioms, and discourse-level inference. We argue that even such a simple re-analysis can serve as a low-cost yet high-impact tool for making inequities in multilingual NLP visible, especially for communities that rarely appear in headline benchmarks.

Submission Number: 20

Loading