## ℹ️ About
This repo contains the codes for SweLoc dataset construction.

## 🚀 Quick Start

Install the required dependencies:
```
pip install -r requirements.txt
export PYTHONPATH="$(pwd)/src"
```

## Dataset Construction

### Scraping PRs

To scrape PRs, run:
```
cd src/collect
bash pr_scraping.sh
```
Note that you need to set GitHub token on `pr_scraping.sh` to scrape PRs from GitHub repositories
```bash
git_token="Your GitHub token here"
```

### Negtaive Mining

To extract the functions and mine the negative codes for relevant PRs, run

```
cd ..
bash negative_mining.sh
```
The above script contains quality filtering and saves the dataset at `repo_contrastive_mined_filtered.jsonl`. Make sure that `PR_DATASET` is set to the output file of PR scraping.

### Reranker Data Construction

To generate reranker training data with the filtered contrastive mined dataset, run
```
bash get_reranker_data.sh
```

## 
This repo contains the codes for SweLoc dataset construction.