A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT

Louis Estève; Christophe Servan; Thomas Lavergne; Agata Savary

A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT

Louis Estève, Christophe Servan, Thomas Lavergne, Agata Savary

Published: 27 May 2026, Last Modified: 27 May 2026UniDive 2026EveryoneRevisionsCC BY-SA 4.0

Keywords: diversity quantification, sampling, encoder quality, modernbert

Working Group: WG4: Quantifying and promoting diversity

Abstract: Diversity has been gaining interest in NLP in recent years. At the same time, state-of-the-art transformer models such as ModernBERT use very large pre-training datasets, which are driven by size rather than by diversity. This summons for an investigation of the impact of diversity on the ModernBERT pre-training. We find that diversity-driven sampling allows in some tasks to gain 10 points relative to randomly-sampled pre-training data of commensurate size. We also see that a model pre-trained for 483h on a diversity-driven dataset of 150M tokens can yield a commensurate performance to a model pre-trained for 1,775h on a randomly-driven dataset of 2.4B tokens.

Tracks For Type Of Contribution: Complete work (including previously published work)

Do You Need Visa To Attend The 4th UniDive General Meeting In Romania: No

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 70

Loading