Massively Cross-Language Understanding

Anonymous

Massively Cross-Language Understanding

Anonymous

16 Oct 2023ACL ARR 2023 October Blind SubmissionReaders: Everyone

Abstract: Natural Language Inference (NLI) is one of the standard tasks researchers use to benchmark the language understanding capability of language models. Traditionally, NLI has the premise and hypothesis in the same language, with existing datasets covering 15 languages in a monolingual setting. A cross-language variation, where they are in different languages, is a mostly unexplored task that tests the capabilities of models to understand and correlate text from different languages at once. In this work, we 1) create a cross-language entailment dataset built on existing entailment datasets and expand it to 93 languages, 2) test and provide baselines for the cross-language reasoning capability of large masked language models, and 3) investigate the cross-lingual transfer ability of our dataset. Overall, we found that models perform worse in a cross-language setting than they do monolingually, with performance degrading as we scale up the number of languages. Finally, we show that using our dataset achieves greater cross-lingual transfer than monolingual data does. This work sheds light on the challenges and opportunities for enhancing the cross-language reasoning abilities of language models and invites further exploration of this task.

Paper Type: long

Research Area: Resources and Evaluation

Contribution Types: Approaches to low-resource settings, Data resources, Data analysis

Languages Studied: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bosnian, Bulgarian, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Scottish Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Welsh, Western Frisian, Xhosa, Yiddish

Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.

0 Replies

Loading