# Data Collection
This subfolder contains scripts used to collect and filter GitHub pull requests used in the SWE Bench fine tuning dataset and evaluation benchmark.
The files can be categorized under the following utilities:

## 🧐 GitHub Repository Selection
`get_top_pypi.py`
* Purpose: Retrieves the PyPI URL, GitHub URL, # of ⭐, and # of Issues + PRs for the [top 5000](https://hugovk.github.io/top-pypi-packages/") most downloaded PyPI packages.
* Usage: `python get_top_pypi.py`

## ⛏️ GitHub Data Collection
`print_pulls.py`
* Purpose: Given the `<owner/name>` of a GitHub repo, this script writes the raw information for all the repo's PRs to a single `.jsonl` file
* Usage: `python print_pulls.py <repo name> <path to PRs .jsonl file> --token <GitHub Token>`

`build_dataset.py`
* Purpose: Given the path to a PRs `.jsonl` file generated by `print_pulls.py`, this script attempts to convert each PR to a task instance. It creates a `jsonl.all` file for any PRs with an issue and a `.jsonl` file for any PRs with both an issue and modifications to that repository's tests.
* Usage: `python build_dataset.py <path to PRs .jsonl file> <path to output .jsonl file> --token <Github Token>`

`get_tasks_pipeline.py`
* Purpose: Automates invocation of the repo → task instance construction pipeline (`print_pulls.py` + `build_dataset.py`) for multiple repositories
* Usage: `./run_get_tasks_pipeline` (Check file for arguments)

## 🎵 Fine Tuning Dataset Construction
`build_dataset_ft.py`
* Purpose: Given the path to a collection of `.jsonl.all` files generated by `build_dataset.py`, this is a simple script to combine all such files into a single `.jsonl` that can be used to construct a instruction tuning dataset based on [problem statement + original code, code Δ] pairs.
* Usage: `./run_build_dataset_ft` (Check file for arguments)

## 🪞 Mirroring Repositories
`make_repo.sh`
* Purpose: A script for creating a [mirror repository](https://docs.github.com/en/repositories/creating-and-managing-repositories/duplicating-a-repository) of an existing repository on GitHub. Examples available under the [swe-bench organization](https://github.com/orgs/swe-bench/repositories).
* Usage: `python call_make_repo.py` (Check file for arguments)

## 🧹 Clean Up
`delete_gh_workflows.py`
* Purpose: Recurring workflows from mirror repositories can clog up your inbox for the email account associated with your GitHub token. Given a repo URL, this will automate removing the `.github/workflows` folder from all branches of a repository.
* Usage: `python delete_gh_workflows.py <repo URL>`

`remove_envs.py`
* Purpose: SWE Bench's evaluation + validation harnesses rely on the creation of multiple virtual environments with conda to speed up benchmark evaluation. Use these script to parallelize conda environment removal for environments named with the same prefix.
* Usage: `python remove_envs.py <prefix> --conda_path <path to conda installation>`