Abstract: Despite the prominence of decoder-only language models, encoders remain crucial for resource-constrained applications.
We introduce ModernGBERT (134M, 1B), a fully transparent family of German encoder models trained from scratch, incorporating architectural innovations from ModernBERT.
To evaluate the practical trade-offs of training encoders from scratch, we also present LLäMmlein2Vec (120M, 1B, 7B), a family of encoders derived from German decoder-only models via LLM2Vec.
We benchmark all models on natural language understanding, text embedding, and long-context reasoning tasks, enabling a controlled comparison between dedicated encoders and converted decoders.
Our results show that ModernGBERT 1B outperforms prior state-of-the-art German encoders as well as encoders adapted via LLM2Vec, with regard to performance and parameter-efficiency.
All models, training data, checkpoints and code will be made publicly available, advancing the German NLP ecosystem with transparent, high-performance encoder models.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Transparency, Pre-training, Scaling, Self-supervised learning, Encoder, Resources for less-resourced languages
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: German
Submission Number: 1150
Loading