Keywords: Debugging, Coding, LLM, Precision, Evaluation
Abstract: Unlike code completion, debugging requires localizing faults and applying targeted edits.
We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging.
To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmarking (PDB) framework, an automatic pipeline that converts any coding dataset into a debugging benchmark with precision-aware evaluation.
PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs.
We further define two novel metrics, edit-level precision and bug-level recall, which provide fine-grained measurements of how many necessary edits are made and how many bugs are resolved.
We release PDB-Hard, a 5,734-example evaluation set for measuring debugging precision and recall.
Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass rates above $78\%$ but exhibit precision below $45\%$, even when explicitly instructed to perform minimal debugging.
Finally, we show that iterative and agentic debugging strategies do not substantially improve precision or recall, highlighting the need to rethink post-training pipelines for coding models.
Paper Type: Long
Research Area: Code Models
Research Area Keywords: Program repair, Evaluation of code models, Code reasoning, Bug detection
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 10547
Loading