NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark

NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark

ACL ARR 2025 February Submission5770 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper introduces NorEval, a new and comprehensive evaluation suite for large-scale standardized benchmarking of Norwegian generative language models (LMs). NorEval consists of 24 high-quality human-created datasets – of which five are created from scratch. In contrast to existing benchmarks for Norwegian, NorEval covers a broad spectrum of task categories targeting Norwegian language understanding and generation, establishes human baselines, and focuses on both of the official written standards of the Norwegian language: Bokmål and Nynorsk. All our datasets and a collection of over 100 human-created prompts are integrated into LM Evaluation Harness, ensuring flexible and reproducible evaluation. We describe the NorEval design and present the results of benchmarking 19 open-source pretrained and instruction-tuned LMs for Norwegian in various scenarios. Our benchmark, evaluation framework, and annotation materials will be made publicly upon acceptance.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking,language resources,automatic creation and evaluation of language resources

Contribution Types: Data resources

Languages Studied: Norwegian

Submission Number: 5770

Loading