MiniCzechBenchmark: A Contamination-Resistant Framework for Rapid LLM Evaluation in Czech

Petr Simecek

MiniCzechBenchmark: A Contamination-Resistant Framework for Rapid LLM Evaluation in Czech

Petr Simecek

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM evaluation, multilingual, Czech, benchmarking

TL;DR: MiniCzechBenchmark is a fast Czech LLM benchmark that matches full-suite rankings while exposing language–reasoning gaps.

Abstract: The evaluation of large language models faces dual challenges: comprehensive benchmarks require prohibitive computational resources while contamination threatens validity. For non-English languages, these challenges compound, creating barriers to the rapid iteration that drives AI breakthroughs. We present MiniCzechBenchmark, a lightweight framework demonstrating that carefully designed subset evaluation can maintain statistical validity (>0.996 correlation with full benchmarks) while reducing computational requirements by over 90\%. As actively maintained Czech benchmark covering 50+ models, both open and commercial, we reveal critical patterns: reasoning-focused models like DeepSeek-R1 excel at mathematics (78\%) but degrade on grammatical tasks, while 10-30\% performance gaps persist between English and Czech capabilities. Notably, we find that multiple-choice accuracy and generation quality represent distinct competencies—a model may correctly answer Czech questions while producing poor Czech text.

Submission Number: 223

Loading