Keywords: LLM evaluation, multilingual, Czech, benchmarking
TL;DR: MiniCzechBenchmark is a fast Czech LLM benchmark that matches full-suite rankings while exposing language–reasoning gaps.
Abstract: The evaluation of large language models faces dual challenges: comprehensive benchmarks require prohibitive computational resources while contamination threatens validity. For non-English languages, these challenges compound, creating barriers to the rapid iteration that drives AI breakthroughs. We present MiniCzechBenchmark, a lightweight framework demonstrating that carefully designed subset evaluation can maintain statistical validity (>0.996 correlation with full benchmarks) while reducing computational requirements by over 90\%. As actively maintained Czech benchmark covering 50+ models, both open and commercial, we reveal critical patterns: reasoning-focused models like DeepSeek-R1 excel at mathematics (78\%) but degrade on grammatical tasks, while 10-30\% performance gaps persist between English and Czech capabilities. Notably, we find that multiple-choice accuracy and generation quality represent distinct competencies—a model may correctly answer Czech questions while producing poor Czech text.
Submission Number: 223
Loading