Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

ACL ARR 2026 January Submission10547 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Debugging, Coding, LLM, Precision, Evaluation
Abstract: Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmarking (PDB) framework, an automatic pipeline that converts any coding dataset into a debugging benchmark with precision-aware evaluation. PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We further define two novel metrics, edit-level precision and bug-level recall, which provide fine-grained measurements of how many necessary edits are made and how many bugs are resolved. We release PDB-Hard, a 5,734-example evaluation set for measuring debugging precision and recall. Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass rates above $78\%$ but exhibit precision below $45\%$, even when explicitly instructed to perform minimal debugging. Finally, we show that iterative and agentic debugging strategies do not substantially improve precision or recall, highlighting the need to rethink post-training pipelines for coding models.
Paper Type: Long
Research Area: Code Models
Research Area Keywords: Program repair, Evaluation of code models, Code reasoning, Bug detection
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 10547
Loading