Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

Published: 28 Mar 2026, Last Modified: 07 May 2026AIware 2026EveryoneRevisionsCC BY 4.0
Keywords: Agentic AI, Code Generation, Code Quality, Software Security
Abstract: As AI agents increasingly contribute to code development and maintenance, there is still limited empirical evidence on the quality and risk characteristics of their changes in real-world projects, particularly for refactoring-oriented contributions. It remains unclear how agent-authored refactoring edits affect maintainability, code quality, and security once merged into GitHub repositories. To address this gap, we conduct an empirical study of Python refactoring pull requests (PRs) from the AIDev dataset. We analyze agentic refactoring PRs using PyQu, an ML-based quality assessment tool for Python, to quantify changes across five quality attributes, and we complement PyQu with domain-independent static analysis (Pylint and Bandit) to measure code-quality and security issues before and after each change. Our results show that agentic commits improve at least one quality attribute in 22.5\% of the studied changes, with usability improving most frequently (36.5\%). At the same time, 24.17\% of modified files introduce new Pylint issues—predominantly convention-level violations such as long lines—while 4.7\% introduce new Bandit findings. From the observed diffs, we derive a taxonomy of 24 recurring change operations and map them to the lint and security findings they most commonly affect. Despite these mixed outcomes, developer acceptance is high: 73.5\% of the analyzed PRs are merged, including cases that introduce new lint or security warnings, often alongside the removal of existing issues. Overall, these findings highlight both the promise and current limitations of agentic refactoring, and motivate stronger tool-in-the-loop quality and security gating for AI-driven development workflows.
Revision Summary: To address the reviewers’ comments, we revised the paper as following: * **Clarified the filtering procedure for Pylint and Bandit outputs.** We added details explaining which Pylint findings were excluded, including purely stylistic findings, import-related warnings, and fatal messages, and why these exclusions were necessary. We also clarified that Bandit analysis excludes test files to avoid inflating security counts due to common test-code patterns such as `assert_used`. * **Clarified how change-operation frequencies should be interpreted.** We added an explanation that the manual operation analysis is based on high-impact sampled commits and that a single commit can contain multiple operations. * **Added agent-level merge-rate analysis with caution about imbalance.** We expanded RQ3 by reporting merge rates across agents. * **Strengthened the threats-to-validity discussion around AIDev labels.** We clarified that AIDev categorizes PRs into task types using GPT-4.1-mini based on PR titles and bodies, rather than relying only on GitHub labels. * **Added future-work framing for comparison with human-authored PRs.** We clarified in the conclusion that this study characterizes agentic Python refactoring PRs rather than directly comparing them against human-authored refactoring PRs. This addresses reviewer comments about the need for a human baseline while positioning it as future work.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public.
Paper Type: Full-length papers (i.e. case studies, theoretical, applied research papers). 8 pages
Reroute: false
Submission Number: 63
Loading