Advancing NLP Equity: A Secondary Benchmark Evaluation of Multilingual Language Models for Underrepresented Languages
Keywords: Multilingual PLMs, Zero-shot NLI, Low-resource languages
TL;DR: We audit XNLI-tuned XLM-R on AmericasNLI/XNLI, finding a 56% gap between high- and low-resource languages. A Quechua case study shows multilingual models fail low-resource communities, highlighting the need for simple fairness audits.
Abstract: Recent multilingual language models promise support for
“100+ languages,” yet speakers of Indigenous and other underrepresented languages still often do not see themselves
in these advances. In this work, we take a deliberately simple, secondary-benchmark perspective: rather than proposing a new model or dataset, we re-evaluate an off-the-shelf multilingual natural language inference (NLI) model
on public benchmarks that explicitly include Indigenous languages of the Americas. Concretely, we use the AmericasNLI benchmark for ten Indigenous languages and XNLI
for English and Spanish, and we evaluate the widely used
joeddav/xlm-roberta-large-xnli model under a
fixed, zero-shot protocol. Our goal is to answer three questions: (i) How large is the performance gap between high-
resource and underrepresented languages under the same
model and task? (ii) Are these gaps consistent across languages, or do some communities fare systematically worse
than others? (iii) What kinds of qualitative errors arise, and
what do they suggest about cultural and linguistic mismatch?
Our experiments reveal a striking discrepancy: while English
and Spanish reach almost perfect accuracy on XNLI (around
99.8% on our runs), the same model averages only about 43%
accuracy across ten Indigenous languages in AmericasNLI,
with none exceeding 47%. We also show qualitative NLI failures in Quechua that point to difficulties with morphology, idioms, and discourse-level inference. We argue that even such
a simple re-analysis can serve as a low-cost yet high-impact
tool for making inequities in multilingual NLP visible, especially for communities that rarely appear in headline benchmarks.
Submission Number: 20
Loading