Abstract: The rapid evolution of Natural Language Processing (NLP) has favoured major languages
such as English, leaving a significant gap for
many others due to limited resources. This
is especially evident in the context of data annotation, a task whose importance cannot be
underestimated, but which is time-consuming
and costly. Thus, any dataset for resource-poor
languages is precious, in particular when it is
task-specific. Here, we explore the feasibility
of repurposing an existing multilingual dataset
for a new NLP task: we repurpose a subset
of the BELEBELE dataset (Bandarkar et al.,
2023), which was designed for multiple-choice
question answering (MCQA), to enable the
more practical task of extractive QA (EQA)
in the style of machine reading comprehension. We present annotation guidelines and
a parallel EQA dataset for English and Modern Standard Arabic (MSA). We also present
QA evaluation results for several monolingual
and cross-lingual QA pairs including English,
MSA, and five Arabic dialects. We aim to
help others adapt our approach for the remaining 120 BELEBELE language variants, many of
which are deemed under-resourced. We also
provide a thorough analysis and share insights
to deepen understanding of the challenges and
opportunities in NLP task reformulation.
Loading