Abstract: Natural Question Answering (QA) datasets play a crucial role in evaluating the capabilities of large language models (LLMs), ensuring their effectiveness in real-world applications. Despite the numerous QA datasets that have been developed and some work has been done in parallel, there is a notable lack of a framework and large scale region-specific datasets queried by native users in their own languages. This gap hinders the effective benchmarking and the development of fine-tuned models for regional and cultural specificities. In this study, we propose a scalable, language-independent framework, NativQA, to seamlessly construct culturally and regionally aligned QA datasets in native languages, for LLM evaluation and tuning. We demonstrate the efficacy of the proposed framework by designing a multilingual natural QA dataset, MultiNativQA, consisting of ~64k manually annotated QA pairs in seven languages, ranging from high to extremely low resource, based on queries from native speakers from 9 regions covering 18 topics. We benchmark open- and closed-source LLMs with the MultiNativQA dataset. We made the framework NativQA, MultiNativQA dataset, and other experimental scripts publicly available for the community (https://anonymous.com/).
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: resources for less-resourced languages, multilingual benchmarks, multilingual corpora, NLP datasets, datasets for low resource languages
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: Arabic, Assamese, Bangla, English, Hindi, Nepali, Turkish
Submission Number: 826
Loading