# EditPackFT

This directory contains the code for reproducing EditPackFT from CommitPackFT.
The `filter.py` script filters and cleans the code from CommitPackFT,
and the `format.py` script adds the training prompt to each example.

## Near-Deduplication

An additional step is used to remove near-duplicate examples from the dataset.
Once the dataset is built and formatted, we utilize the [text-dedup](ChenghaoMou/text-dedup)
framework to remove near-duplicate examples; this deduplication step utilizes
MinHash and Locality Sensitive Hashing (LSH) to identify near-duplicate examples and
discard them from the dataset. We use threshold of 0.5 for the Jaccard similarity
to determine if two examples are near-duplicates.
