QuanBench Plus: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

Ali Slim; Haydar Hamieh; Jawad Kotaich; Yehya Ghosn; Mahdi Chehimi; Hasan Abed Al Kader Hammoud; Ammar Mohanna; Bernard Ghanem

QuanBench Plus: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

Ali Slim, Haydar Hamieh, Jawad Kotaich, Yehya Ghosn, Mahdi Chehimi, Hasan Abed Al Kader Hammoud, Ammar Mohanna, Bernard Ghanem

Published: 02 Mar 2026, Last Modified: 09 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0

Keywords: large language models, quantum programming, benchmarking, Qiskit, PennyLane, Cirq

TL;DR: We present QuanBench Plus, a unified benchmark that evaluates how reliably modern LLMs generate functionally correct quantum programs across Qiskit, PennyLane, and Cirq, using Pass@k and distribution-based grading with a feedback repair loop.

Abstract: Large language models (LLMs) are increasingly used for code generation and task automation. However, their effectiveness in quantum code generation across multiple major frameworks remains underexplored. This work introduces \textit{QuanBench Plus}, a unified multi-framework benchmark spanning Qiskit, Pennylane, and Cirq. Specifically, 42 tasks are adapted across three foundational categories (quantum algorithms, gate decomposition, and state preparation) and framework-aligned canonical solutions are provided for automated grading. Following the functional-evaluation paradigm popularized by functional code-generation benchmarks such as HumanEval, correctness assessment is standardized using Pass@k-based functional evaluation and KL-divergence-based acceptance is added for probabilistic outputs. The Pass@1 results are reported using greedy decoding and Pass@5 results using $k=5$ samples per task. Pass@1 after feedback (FB) is additionally reported when feedback to the model is triggered by an incorrect answer or a compilation error. Fidelity is excluded from primary scoring because circuit similarity may not reflect prompt-specific functional correctness. The best-performing models achieve Pass@1 results up to 42.9\% in Pennylane, 54.8\% in Cirq, and 59.5\% in Qiskit, illustrating both progress and remaining gaps in using LLMs for reliable quantum code generation.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 109

Loading