Patch Generation with Language Models: Feasibility and Scaling BehaviorDownload PDF

04 Mar 2022, 03:33 (modified: 22 Apr 2022, 13:00)DL4C 2022Readers: Everyone
Abstract: Large language models have shown a propensity for generating correct, multi-line programs from natural language prompts. Given past findings highlighting that bugs and patches can be distinguished by predictability according to simple language models, it is natural to ask if modern, large neural options lend themselves especially well to program repair without any calibration. We study this in the context of one-line bugs, providing a series of models of varying scales (from 160M to 12B parameters) with the context preceding a buggy line in 72 Java and Python programs and analyze the rank at which the correct patch (and original buggy line) is generated, if at all. Our results highlight a noticeable correlation of model size with test-passing accuracy and patch ranking quality, as well as several other findings related to the differences between the two languages and the propensity for especially the largest models to generate candidate patches that closely resemble (if not exactly match), the original developer patch.
1 Reply