Leveraging High-Fidelity LLMs for Synthetic Data Generation: A Scalable Pivot Strategy

Kshetrimayum Boynao Singh, Deepak Kumar, Ramakrishna Appicharla, Asif Ekbal, Partha Pakray

Published: 06 Jan 2026, Last Modified: 17 Mar 2026CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: The research demonstrates that Large Language Models (LLMs) can generate synthetic parallel data for Indicto-Indic language pairs, thereby alleviating resource constraints in critical legal contexts. The study evaluated the latest LLM Gemini-2.5-Pro, Claude-4, and LLaMA-4-Maverick for the translation of legal documents from English to Hindi, utilising a specialised 5000-bitext dataset. The evaluation used a full multimetric method that combined six Automatic Evaluation Metrics with evaluations done by trained human evaluators. Gemini-2.5-Pro did better than all of its competitors in automated semantic analysis and human-rated fidelity, even without any fine-tuning. This method is a strong, data-driven way to improve India's machine translation infrastructure and make it work better with more languages. The research underscores the capacity of LLMs to improve the accessibility and communication of Indic-to-Indic language pairs. The dataset and related resources are publicly available.

External IDs:doi:10.36227/techrxiv.176774200.01023088/v1