OccGen: Selection of Real-world Multilingual Parallel Data Balanced in Gender within OccupationsDownload PDF

Published: 17 Sept 2022, Last Modified: 23 May 2023NeurIPS 2022 Datasets and Benchmarks Readers: Everyone
Keywords: Balanced Multilingual Data Set, Gender, Occupations, Machine Translation
TL;DR: We present the OccGen toolkit that builds multilingual parallel data sets balanced in gender within occupations. The toolkit is released together with two datasets in four high-resource languages and in a low-resource language (with English).
Abstract: This paper describes the OCCGEN toolkit, which allows extracting multilingual parallel data balanced in gender within occupations. OCCGEN can extract datasets that reflect gender diversity (beyond binary) more fairly in society to be further used to explicitly mitigate occupational gender stereotypes. We propose two use cases that extract evaluation datasets for machine translation in four high-resource languages from different linguistic families and in a low-resource African language. Our analysis of these use cases shows that translation outputs in high-resource languages tend to worsen in feminine subsets (compared to masculine). This can be explained because less attention is paid to the source sentence. Then, more attention is given to the target prefix overgeneralizing to the most frequent masculine forms.
Supplementary Material: pdf
Contribution Process Agreement: Yes
In Person Attendance: Yes
URL: https://github.com/mt-upc/OccGen_dataset
Dataset Url: https://github.com/mt-upc/OccGen_dataset
License: CC-BY-SA 3.0
Author Statement: Yes
28 Replies