QC-Bench: What Do Language Models Know About Quantum Computing?

QC-Bench: What Do Language Models Know About Quantum Computing?

ICLR 2026 Conference Submission20330 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Quantum Computing, Large Language Models, Model Reliability

TL;DR: We introduce QC-Bench, a human-authored benchmark designed to evaluate large language models on core topics in quantum computing.

Abstract: Language models increasingly interact with quantum computing content through theoretical exploration, paper summarization, and educational assistance, yet their factual accuracy on quantum computing concepts remains unmeasured. QC-Bench addresses this gap with 6,237 questions covering quantum algorithms, error correction, security protocols, circuit design, and theoretical foundations. We designed expert-level questions informed by over 200 peer-reviewed papers from four decades of quantum computing research to construct the benchmark. Evaluation across 31 models from OpenAI, Anthropic, Google, Meta, and others reveals strong performance on established theory contrasted with systematic failure on advanced topics such as quantum security and recent attack vectors. We compared model performance against quantum computing experts and practitioners who achieved scores ranging from 26.7\% to 86.7\%. Notably, 8 models outperformed the human expert average of 83.3\%, yet all models struggled with questions about recent developments in advanced quantum computing topics. Top performers Claude Sonnet 4 and GPT-5 achieved 88\% overall accuracy but drop to 76\% on security questions. Cross-format testing shows models achieve high multiple-choice scores but struggle with generating coherent explanations without answer options, with some models dropping 20 percentage points. Multilingual testing revealed an interesting pattern: models consistently performed best in English, maintained reasonable accuracy in French (11.2\% degradation), but showed notably larger performance drops in Spanish (16.2\% degradation), indicating that quantum computing knowledge does not transfer uniformly across languages. As language models become integral to scientific workflows and even peer review processes where quantum computing research is evaluated, ensuring their domain accuracy is critical for the AI community. QC-Bench offers a reliable benchmark for developing and validating AI systems at the intersection of quantum computing and machine learning.

Primary Area: datasets and benchmarks

Submission Number: 20330

Loading