Keywords: Geographical variation, evaluation, NLI, Basque, Spanish
TL;DR: XNLIvar, a manually curated dataset for language variants of Basque and Spanish, shows a decrease in performance in encoder only and decoder based Large Language Models, showing the need for research for variation-inclusive NLP resources
Abstract: In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use NLI as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish and their corresponding variants. Empirical analysis of comprehensive crosslingual and in-context learning experiments with respectively, encoder-only and decoder-based Large Language Models (LLMs), reveals a performance drop when processing linguistic variations, with more pronounced effects observed in Basque. Error analysis indicates that lexical overlap plays no role, suggesting that linguistic variation represents the primary reason for the lower results. All data and code are publicly available under Attribution-NonCommercial 4.0 International license.
Supplementary Material: pdf
Submission Number: 181
Loading