EIFFEL: a novel benchmark to measure bias of English heavy training on French idiomatic expressions

EIFFEL: a novel benchmark to measure bias of English heavy training on French idiomatic expressions

ACL ARR 2026 January Submission5848 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM benchmark, multilingual evaluation, French, idiomatic expressions

Abstract: Mainstream multilingual LLMs are generally trained on a much higher proportion of English than multilingual data, raising questions about their ability to capture linguistic features particular to non-English languages or to capture information important to non-anglophone cultures. We add to a growing effort to increase multilingual sensitivity in LLMs by developing a benchmark, EIFFEL, testing mastery of French idiomatic expressions in context. We fully explain the methodology, which exploits input from native French speakers, to make it reproducible for other languages. We compare mainstream multilingual LLMs with French-focused LLMs both on standard LLM benchmarks and EIFFEL; EIFFEL brings out the benefits of higher proportions of French data and shows limitations of standard benchmarks for measuring multilingual competence. We also train from scratch a series of 1B SLMs with different proportions of French and English pretraining data that confirm EIFFEL's lessons.

Paper Type: Long

Research Area: Multilinguality and Language Diversity

Research Area Keywords: multilingual evaluation, multilingual benchmarks, corpus creation, benchmarking, language resources, multilingual corpora, NLP datasets, multilingual pre-training, cross-lingual transfer

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: French, English

Submission Number: 5848

Loading