Abstract: Natural Language Inference (NLI) is one of the standard tasks researchers use to benchmark the language understanding capability of language models. Traditionally, NLI has the premise and hypothesis in the same language, with existing datasets covering 15 languages in a monolingual setting. A cross-language variation, where they are in different languages, is a mostly unexplored task that tests the capabilities of models to understand and correlate text from different languages at once. In this work, we 1) create a cross-language entailment dataset built on existing entailment datasets and expand it to 93 languages, 2) test and provide baselines for the cross-language reasoning capability of large masked language models, and 3) investigate the cross-lingual transfer ability of our dataset. Overall, we found that models perform worse in a cross-language setting than they do monolingually, with performance degrading as we scale up the number of languages. Finally, we show that using our dataset achieves greater cross-lingual transfer than monolingual data does. This work sheds light on the challenges and opportunities for enhancing the cross-language reasoning abilities of language models and invites further exploration of this task.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Approaches to low-resource settings, Data resources, Data analysis
Languages Studied: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bosnian, Bulgarian, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Scottish Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Welsh, Western Frisian, Xhosa, Yiddish
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading