The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses
Abstract: Gender bias in natural language processing
(NLP) applications, particularly machine translation, has been receiving increasing attention.
Much of the research on this issue has focused
on mitigating gender bias in English NLP models and systems. Addressing the problem in
poorly resourced, and/or morphologically rich
languages has lagged behind, largely due to
the lack of datasets and resources. In this paper, we introduce a new corpus for gender identification and rewriting in contexts involving
one or two target users (I and/or You) – first
and second grammatical persons with independent grammatical gender preferences. We focus on Arabic, a gender-marking morphologically rich language. The corpus has multiple parallel components: four combinations of
1
st and 2nd person in feminine and masculine
grammatical genders, as well as English, and
English to Arabic machine translation output.
This corpus expands on Habash et al. (2019)’s
Arabic Parallel Gender Corpus (APGC v1.0)
by adding second person targets as well as increasing the total number of sentences over
6.5 times, reaching over 590K words. Our
new dataset will aid the research and development of gender identification, controlled text
generation, and post-editing rewrite systems
that could be used to personalize NLP applications and provide users with the correct outputs based on their grammatical gender preferences. We make the Arabic Parallel Gender
Corpus (APGC v2.0) publicly available.
0 Replies
Loading