Tonative: Community-Driven Extension of African Datasets Through Human-AI Collaboration
Keywords: Machine translation, Low-resource, African datasets
Abstract: The creation of language resources for African languages faces significant challenges related to sustainability. As a result, thousands of languages on the continent remain severely low-resource. Although community-led efforts have been impactful, they are expensive and less scalable. The use of synthetic data from large language models may be more scalable, but presents the risks of introducing ‘translationese’ and amplifying existing biases. This paper presents the Tonative project, a human-AI collaborative framework designed to continuously extend existing African language datasets. Our pipeline combines automated translation with community-based human validation, which reduces the manual review workload while ensuring authenticity. We applied this approach to extend existing datasets, improving their linguistic coverage and representation. This work provides a foundation for more sustainable contributions to African NLP by leveraging existing resources and collaborating with native speakers.
Submission Number: 50
Loading