Keywords: Machine Translation, Multilingual NLP, AI safety, Foundation Models, Data-Centric Evaluation
Abstract: Large language models (LLMs) perform best in high resource languages, motivating Machine Translation (MT) as a preprocessing step for multilingual inference. However, translation may alter task-relevant linguistic cues, degrading downstream models. It remains unclear whether such degradation is arbitrary or systematic across languages. We quantify translation-induced downstream drift using round-trip translation (English to pivot language to English) across eight pivot languages from Europe (German, Spanish, French, Italian, Portuguese) and Asia (Chinese, Hindi, Thai) while holding the source texts and downstream models fixed. Across two downstream tasks (radiology finding extraction from clinical reports and text retrieval), translation introduces performance drops that increase with language difficulty (US State Dept. categories; Spearman $\lvert$$\rho$$\rvert$ $\geq$ 0.83), suggesting systematic rather than random drift and providing an external, pre-translation diagnostic. Over repeated round-trip translations, performance drops early, and then stabilizes in subsequent round-trips. Semantic similarity metrics (COMET) can track this drift, providing a lightweight post-translation diagnostic for downstream drift. Our findings suggest that preprocessing non-English texts using MT may introduce systematic biases that could degrade downstream trained models and tasks, highlighting an important pitfall in the equitable multilingual use of LLMs at scale.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 141
Loading