Keywords: NLU, reasoning limitations, scaling laws, instruction following, benchmarking
TL;DR: COLE, a 23-task French benchmark , shows open models lag behind closed ones and specialized models lack deep reasoning. Catastrophic failures occur in regional dialects and zero-shot extraction.
Abstract: Despite LLMs' dominance in English, their transferability to French NLU remains inconsistent.
To characterize this limitation, we present COLE, a new benchmark comprising 23 diverse French tasks, and use it to reveal significant failure modes in state-of-the-art (SOTA) models. Our analysis reveals three critical negative results: 1) a persistent performance gap where top open-weight models lag behind closed models by over 20\%, 2) the illusion of specialization, where surface-level fluency in tuned models masks deep reasoning deficits, and 3) catastrophic failure in zero-shot extractive QA and regional dialect understanding, where many models, including top-tier reasoning models, achieve 0\% Exact Match or perform near random baselines.
We analyze these unexpected failures to highlight specific frontiers (morphology, cultural nuance) where scaling laws currently fail to generalize.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 26
Loading