Synthetic Data Generation Pipeline for Low-Resource Swahili Sentiment Analysis: Multi-LLM Judging with Human Validation

Published: 27 Jan 2026, Last Modified: 17 Feb 2026AfricaNLP 2026EveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite serving over 100 million speakers as a vital African lingua franca, Swahili remains critically under-resourced for Natural Language Processing, hindering technological progress across East Africa. We present a scalable solution: a controllable synthetic data generation pipeline that produces culturally grounded Swahili text for sentiment analysis, validated through automated LLM judges. To ensure reliability, we conduct targeted human evaluation with a native Swahili speaker on a stratified sample, achieving 80.95% agreement between generated sentiment labels and human ground truth, with strong agreement on judge quality assessments. This demonstrates that LLM-based generation and quality assessment can transfer effectively to low-resource languages. We release a dataset and provide a reproducible pipeline in tandem, providing ample knowledge and working material for NLP researchers in low-resource contexts. We release the resulting Swahili sentiment dataset and the full reproducible generation pipeline publicly at https://huggingface.co/datasets/tabularisai/swahili-sentiment-dataset and https://github.com/tabularis-ai/Synthetic-Data-Generation-Pipeline-for-Low-Resource-Swahili-Sentiment-Analysis.
Submission Number: 23
Loading