Keywords: large language models, quantum programming, benchmarking, Qiskit, PennyLane, Cirq
TL;DR: We present QuanBench Plus, a unified benchmark that evaluates how reliably modern LLMs generate functionally correct quantum programs across Qiskit, PennyLane, and Cirq, using Pass@k and distribution-based grading with a feedback repair loop.
Abstract: Large language models (LLMs) are increasingly used for code generation and task automation. However, their effectiveness in quantum code generation across multiple major frameworks remains underexplored. This work introduces \textit{QuanBench Plus}, a unified multi-framework benchmark spanning Qiskit, Pennylane, and Cirq. Specifically, 42 tasks are adapted across three foundational categories (quantum algorithms, gate decomposition, and state preparation) and framework-aligned canonical solutions are provided for automated grading. Following the functional-evaluation paradigm popularized by functional code-generation benchmarks such as HumanEval, correctness assessment is standardized using Pass@k-based functional evaluation and KL-divergence-based acceptance is added for probabilistic outputs. The Pass@1 results are reported using greedy decoding and Pass@5 results using $k=5$ samples per task. Pass@1 after feedback (FB) is additionally reported when feedback to the model is triggered by an incorrect answer or a compilation error. Fidelity is excluded from primary scoring because circuit similarity may not reflect prompt-specific functional correctness. The best-performing models achieve Pass@1 results up to 42.9\% in Pennylane, 54.8\% in Cirq, and 59.5\% in Qiskit, illustrating both progress and remaining gaps in using LLMs for reliable quantum code generation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 109
Loading