Enhancing Pull Request Reviews: Leveraging Large Language Models to Detect Inconsistencies Between Issues and Pull Requests

Published: 01 Jan 2025, Last Modified: 20 Jul 2025Forge@ICSE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Context: Efficient Pull Request (PR) review process is critical in software development. This process includes checking the alignment between PRs and their corresponding issues. The traditional manual PR review often struggles with identifying inconsistencies between the intended improvements or fixes outlined in issues and the actual changes proposed in PRs. This difference can lead to overlooked inconsistencies in the PR acceptance process.Objective: We aim to enhance the PR review process by leveraging modern LLMs to detect inconsistencies between issue descriptions and code changes in submitted PRs.Method: We manually labeled a statistically significant sample of PRs from the Transformers repository to assess their alignment with corresponding issue descriptions. Each PR was categorized into one of four groups: exact, missing, tangling, or missing and tangling. This labeled dataset served as the benchmark for evaluating the performance of four widely used models: Llama3.1-70B-Instruct, Llama-3.1-405B-Instruct, GPT-4o, and GPT-4o mini. The models were tested using three distinct prompts designed to capture different aspects of issues and PRs. Each model was tasked with identifying tangled and missing elements, and their outputs were compared against the manually labeled data to assess their accuracy and reliability.Results: The manual labeling process in the stratified-sampled Transformers repository revealed the following distribution of PR-issue pair alignments: 68.04% were exact, 16.5% were missing, 13.40% were tangling, and 2.06% exhibited both missing and tangling characteristics. A strong correlation was observed between PR merge status and exact alignment, with 75.46% of merged PRs classified as exact, compared to only 29.03% of unmerged PRs. These findings highlight opportunities for improving the current code review process. For automated classification, the most effective prompt configuration combined issue text, PR text, and PR diff, enabling better detection of alignment inconsistencies. Among the models tested, GPT-4o and Llama-3.1-405B-Instruct delivered the highest performance, achieving the best F1 weighted scores of 0.5948 and 0.6190, respectively.Conclusion: Despite a notable correlation between PR merge status and exact alignment, our analysis revealed that merged PRs can still contain inconsistencies, such as missing or tangling changes. While the tested LLMs showed potential in automating PR-issue alignment, their current performance is limited. This underscores the need for further refinement to enhance their accuracy and reliability. Improved LLM-based tools could streamline the PR review process, reducing manual effort and enhancing code quality.
Loading