mNLQuAD: Multilingual Non-Factoid Long-Context Question AnsweringDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: Most existing Question Answering Datasets (QuADs) primarily focus on factoid-based short-context Question Answering (QA) in high-resource languages. However, the scope of such datasets for low-resource languages remains limited, with only a few works centered on factoid-based short-context QuADs and none on non-factoid short/long-context QuADs. Therefore, this work presents mNLQuAD, a multilingual QuAD with non-factoid questions having a long-context. It utilizes interrogative sub-headings from BBC news articles as questions and the corresponding paragraphs as silver answers. The dataset comprises over 370K QA pairs across 42 languages, encompassing several low-resource languages, and stands as the largest multilingual QA dataset to date. Based on the manual annotations of 790 QA-pairs from mNLQuAD (golden set), we observe that 98\% of annotated questions were answered using their corresponding silver answer. Our fine-tuned Answer Paragraph Selection (APS) model outperforms the baselines. The APS model attained an accuracy of 80\% and 72\%, as well as a macro F1 of 72\% and 66\%, on the mNLQuAD testset and the golden set, respectively. Furthermore, the APS model effectively generalizes certain languages within the golden set, even after being fine-tuned on silver labels.
Paper Type: long
Research Area: Question Answering
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: Oromo, Amharic, French, Hausa, Igbo, Gahuza, Pidgin, Somali, Swahili, Tigrinya, Yoruba, Kyrgyz, Uzbek, Burmese, Chinese, Indonesian, Korean, Thai, Vietnamese, Bengali, Gujarati, Hindi, Marathi, Nepali, Pashto, Punjabi, Sinhala, Tamil, Telugu, Urdu, Azeri, Naidheachdan, Russian, Serbian, Turkce, Ukrainian, Cymrufyw, English, Portuguese, Mundo, Arabic, Persian
0 Replies

Loading