# Data Collection
This folder includes the code for the first two parts of the benchmark construction procedure as described in the paper, specifically 1. Repo selection and data scraping, and 2. Attribute-based filtering.

We include a comprehensive [tutorial](https://github.com/princeton-nlp/SWE-bench/tree/main/tutorials/collection.md) that describes the end-to-end procedure for collecting evaluation task instances from PyPI repositories.

> SWE-bench's collection pipeline is currently designed to target PyPI packages. We hope to expand SWE-bench to more repositories and languages in the future.

<img src="../assets/collection.png">

## Collection Procedure
To run collection on your own repositories, run the `run_get_tasks_pipeline.sh` script. Given a repository or list of repositories (formatted as `owner/name`), for each repository this command will generate...
* `<repo>-prs.jsonl` file containing the [metadata for every pull request](https://docs.github.com/rest/reference/pulls#list-pull-requests) from the repository.
* `<repo>-task-instances.jsonl.all` file containing all *valid* task instances (has associated issues + gold patch).
    * This file's values can be used for fine tuning purposes.
* `<repo>-task-instances.jsonl` file containing *valid* task instances that also has associated *tests*.
    * This file's values are candidate task instances. Once validated, they can be used for evaluation purposes.
    * The `.json.all` includes these task instances as well.

## Directory Overview
In this section, we briefly describe each of the files in this directory and its usage details.

**🧐 GitHub Repository Selection**
* `get_top_pypi.py`
    * Purpose: Retrieves the PyPI URL, GitHub URL, # of ⭐, and # of Issues + PRs for the [top 5000](https://hugovk.github.io/top-pypi-packages/") most downloaded PyPI packages.
    * Usage: `python get_top_pypi.py`

**⛏️ GitHub Data Collection**
* `print_pulls.py`
    * Purpose: Given the `<owner/name>` of a GitHub repo, this script writes the raw information for all the repo's PRs to a single `.jsonl` file
    * Usage: `python print_pulls.py <repo name> <path to PRs .jsonl file> --token <GitHub Token>`
* `build_dataset.py`
    * Purpose: Given the path to a PRs `.jsonl` file generated by `print_pulls.py`, this script attempts to convert each PR to a task instance. It creates a `jsonl.all` file for any PRs with an issue and a `.jsonl` file for any PRs with both an issue and modifications to that repository's tests.
    * Usage: `python build_dataset.py <path to PRs .jsonl file> <path to output .jsonl file> --token <Github Token>`
* `get_tasks_pipeline.py`
    * Purpose: Automates invocation of the repo → task instance construction pipeline (`print_pulls.py` + `build_dataset.py`) for multiple repositories
    * Usage: `./run_get_tasks_pipeline` (Check file for arguments)

**🎵 Fine Tuning Dataset Construction**
* `build_dataset_ft.py`
    * Purpose: Given the path to a collection of `.jsonl.all` files generated by `build_dataset.py`, this is a simple script to combine all such files into a single `.jsonl` that can be used to construct a instruction tuning dataset based on [problem statement + original code, code Δ] pairs.
    * Usage: `./run_build_dataset_ft` (Check file for arguments)

**🪞 Mirroring Repositories**
* `make_repo.sh`
    * Purpose: A script for creating a [mirror repository](https://docs.github.com/en/repositories/creating-and-managing-repositories/duplicating-a-repository) of an existing repository on GitHub. Examples available under the [swe-bench organization](https://github.com/orgs/swe-bench/repositories).
    * Usage: `python call_make_repo.py` (Check file for arguments)

**🧹 Clean Up**
* `delete_gh_workflows.py`
    * Purpose: Recurring workflows from mirror repositories can clog up your inbox for the email account associated with your GitHub token. Given a repo URL, this will automate removing the `.github/workflows` folder from all branches of a repository.
    * Usage: `python delete_gh_workflows.py <repo URL>`
* `remove_envs.py`
    * Purpose: SWE Bench's evaluation + validation harnesses rely on the creation of multiple virtual environments with conda to speed up benchmark evaluation. Use these script to parallelize conda environment removal for environments named with the same prefix.
    * Usage: `python remove_envs.py <prefix> --conda_path <path to conda installation>`