Keywords: Quantum Computing, Large Language Models, Model Reliability
TL;DR: We introduce QC-Bench, a human-authored benchmark designed to evaluate large language models on core topics in quantum computing.
Abstract: Language models have become practical tools for quantum computing education and research, from summarizing technical papers to explaining theoretical concepts. While existing benchmarks evaluate quantum code generation and circuit design, their understanding of quantum computing concepts has not been systematically measured. QC-Bench addresses this gap with over 6,000 expert-level questions on quantum algorithms, error correction, and security protocols. Evaluating 31 models from OpenAI, Anthropic, Google, and Meta reveals strong performance on established theory but systematic failures on advanced topics like quantum security and recent attack vectors. Human participants scored between 23\% and 86\%, with experts averaging 74\% and all participants averaging 57\%. Top-performing models exceeded the expert average, with Claude Sonnet 4 and GPT-5 reaching 88\% overall, yet dropping to 76\% on security questions. Additional evaluation across question formats and languages reveals variation in model performance, demonstrating that QC-Bench provides a necessary framework for measuring language model reliability in quantum computing contexts.
Primary Area: datasets and benchmarks
Submission Number: 20330
Loading