Translation-Induced Label Drift across Nine Languages in Natural Language Inference

Published: 28 Apr 2026, Last Modified: 26 May 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: Natural Language Inference, Machine Translation, Cross-Lingual NLP, Intensionality
TL;DR: Direct translation fails in NLI because logical relationships are not translation invariant; instead, they systematically drift toward Neutral due to semantic weakening in diverse languages.
Abstract: Natural Language Inference (NLI) is a foundational task in the field of natural language processing which aims to determine the logical relationship between a premise and a hypothesis (MacCartney, 2009). Multilingual NLI datasets are typically constructed by translating English sentence pairs into the target language(s) and inheriting the original labels without systematic validation. This approach makes the implicit assumption that NLI relations are cross-linguistically invariant. This study provides large-scale empirical evidence that this assumption does not hold. To that end, we sampled 1,000 English premise–hypothesis pairs from established NLI benchmarks, which are MultiNLI (Williams et al., 2018), ANLI (Nie et al., 2020), WANLI (Liu et al., 2022), NLI Fever (Nie et al., 2019), and LING NLI (Parrish et al., 2021), while also ensuring balanced coverage of four linguistically motivated phenomena: conditionality, intensionality, modality, and comparative or quantifier-scope constructions. In constructing these datasets, we ensured balanced coverage of four linguistically motivated phenomena: conditionality, intensionality, modality, and comparative or quantifier-scope constructions. Premises and hypotheses were translated into nine typologically diverse languages: Arabic, Cantonese, German, Spanish, Portuguese, Thai, Turkish, Urdu, and Mandarin, using the gpt-4o-mini API. Consequently, native linguist annotators manually annotated the translated pairs using standard NLI guidelines, without access to the original English labels. Results show that cross-lingual label agreement is uniformly low. First, agreement with English gold labels ranges from approximately 22% ( Arabic, German) to 31% ( Cantonese, Mandarin, Urdu), with macro recall between 0.21 and 0.30 across all languages. Per-label analysis also revealed interesting asymmetries. For instance, Entailment was found to achieve intermediate recall (0.13–0.41 across languages), Contradiction was found to be the most unstable class (recall 0.15– 0.30), and Neutral was systematically over-assigned. We also observed a dominant drift pattern in which both Entailment and Contradiction pairs collapse into Neutral, a behavior that was consistent across all nine languages. This is further reinforced by label shift rates where with Contradiction-to-Neutral transition rate reaching 0.58 in Arabic and 0.52–0.56 in Spanish and German, which indicates a pervasive tendency toward semantic weakening under translation. Additionally, phenomenon-based analysis reveals that label drift is not a random behavior but rather consistently linked to specific constructions. For instance, in an English pair, the English modal should in the hypothesis The product should hit stores by the summer months” was rendered in Arabic as  à @ I.m. ' ( must/obligated to), shifting the hypothesis from a prediction to a normative demand, which changed a clear Entailment into Neutral. This exemplifies a broader pattern in which translation resolves scope, modality, and intensionality in ways that attenuate the inferential force of the original pair. In this talk, we will present more specific cross-linguistic regularities including drift patterns tied to the linguistic phenomena under investigation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 44
Loading