PReview: A Benchmark Dataset for Pull Request Outcomes and Quality Analysis

PReview: A Benchmark Dataset for Pull Request Outcomes and Quality Analysis

ACL ARR 2025 July Submission1304 Authors

29 Jul 2025 (modified: 25 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this work, we present a novel dataset specifically designed for predicting pull request (PR) outcomes using large language models (LLMs). Our dataset is the first to integrate textual and code-related features, allowing the use of LLMs in PR outcome prediction, in contrast to earlier techniques that rely on numerical datasets. To construct this dataset we collected and carefully filtered pull request data from six well-known repositories on GitHub, the largest platform for collaborative code development. The dataset consists of 300 pull requests (PRs), each labeled with `green' and `red' flags to predict whether the PR will be merged or rejected. The PRs are annotated based on key features such as PR title, body, comments, contributor statistics, code changes, and related issues. The merged-to-unmerged PR ratio in the dataset is approximately 2:1. To promote reproducibility and foster further research, we will publicly release the dataset. This work lays the groundwork for building intelligent systems that can assist in PR review and decision-making by leveraging the capabilities of LLMs.

Paper Type: Short

Research Area: Resources and Evaluation

Research Area Keywords: NLP datasets, benchmarking, corpus creation, evaluation methodologies, LLM Efficiency, prompting, reproducibility, software and tools

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 1304

Loading