Human + AI: Large scale Data Curation For Multilingual Guardrails

Harshit Rajgarhia; Abhishek Mukherji; Fen Yik; Dominika Borek; Nicole Warren; Prithiviraj Pradeep

Human + AI: Large scale Data Curation For Multilingual Guardrails

Harshit Rajgarhia, Abhishek Mukherji, Fen Yik, Dominika Borek, Nicole Warren, Prithiviraj Pradeep

Published: 08 Jun 2025, Last Modified: 12 Jun 2025DaSHEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

Keywords: Large Language Models, Multilingual Annotation, LLM-as-a-Judge, Synthetic PII Generation, Human-in-the-Loop, Prompt Authoring

TL;DR: We present an AI-assisted framework that accelerates multilingual prompt authoring with synthetic PII and LLM-based validation, reducing annotation time by over 40% for underrepresented languages.

Abstract: As Large Language Models (LLMs) become increasingly central to real-world applications, the demand for high-quality, instruction-compliant, and multilingual training data has surged, particularly in tier-2 languages with limited digital representation. In this work, we introduce an AI-assisted annotation framework designed to optimize authoring of training data for multilingual guardrails, specifically PII detection, in Supervised Fine-Tuning (SFT) of LLMs. Targeting 13 locales, mostly underrepresented, we operationalize a suite of AI tools to augment human annotators without replacing them. Our results demonstrate a 40+% reduction in average handling time while improving instruction compliance, semantic diversity, and data quality. The key contribution of this work is that we explore the emerging paradigm of 'LLM-as-a-Judge', using LLM not only as generative tools but also as evaluators of human-authored training data.

Copyright Form: pdf

Camera Ready: pdf

Submission Number: 5

Loading