Lost in Translation: Why SOTA LLMs Struggle with French NLU Frontiers

David Beauchemin; Yan Tremblay; Mohamed Amine Youssef; Richard Khoury

Lost in Translation: Why SOTA LLMs Struggle with French NLU Frontiers

David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury

Published: 02 Mar 2026, Last Modified: 02 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0

Keywords: NLU, reasoning limitations, scaling laws, instruction following, benchmarking

TL;DR: COLE, a 23-task French benchmark , shows open models lag behind closed ones and specialized models lack deep reasoning. Catastrophic failures occur in regional dialects and zero-shot extraction.

Abstract: Despite LLMs' dominance in English, their transferability to French NLU remains inconsistent. To characterize this limitation, we present COLE, a new benchmark comprising 23 diverse French tasks, and use it to reveal significant failure modes in state-of-the-art (SOTA) models. Our analysis reveals three critical negative results: 1) a persistent performance gap where top open-weight models lag behind closed models by over 20\%, 2) the illusion of specialization, where surface-level fluency in tuned models masks deep reasoning deficits, and 3) catastrophic failure in zero-shot extractive QA and regional dialect understanding, where many models, including top-tier reasoning models, achieve 0\% Exact Match or perform near random baselines. We analyze these unexpected failures to highlight specific frontiers (morphology, cultural nuance) where scaling laws currently fail to generalize.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 26

Loading