MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation

Published: 24 Sept 2025, Last Modified: 30 Nov 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: corpus creation; benchmarking; language resources; multilingual corpora; nlp datasets; evaluation methodologies; evaluation; datasets for low resource languages; metrics;
TL;DR: MORPHOGEN is a benchmark and evaluation framework testing multilingual LLMs on gender-aware morphological transformations in French, Arabic, and Hindi, exposing critical model limitations and biases.
Abstract: While multilingual large language models (LLMs) perform well on high-level tasks like translation and question answering, their ability to handle grammatical gender and morphological agreement remains underexplored. In morphologically rich languages, gender influences verb conjugation, pronouns, and even first-person constructions with explicit and implicit mentions to gender. We introduce MORPHOGEN a morphologically grounded large-scale benchmark dataset for evaluating gender-aware generation in three typologically diverse grammatically gendered languages i.e. French, Arabic and Hindi. The core task, GENFORM, requires models to rewrite a first-person sentence in the opposite gender while preserving its meaning and structure. We construct a high-quality synthetic dataset spanning French, Arabic, and Hindi, and benchmark 15 popular multilingual LLMs (2B–70B) on their ability to perform this transformation. Our results reveal gaps and interesting insights into the handling of morphological gender in current models. MORPHOGEN offers a focused diagnostic lens for gender-aware language modeling and lays the groundwork for future research on inclusive and morphology-sensitive multilingual LLMs
Submission Number: 211
Loading