# A Step-by-Step Guide for Replicating Our Finetuning Data Processing/Generation Procedure



## Preparing the baseline dataset
```
# Imports
from datasets import load_dataset
import pandas as pd

# Load source datasets
ds1 = load_dataset("bigcode/self-oss-instruct-sc2-exec-filter-50k")['train'].to_pandas()

ds2 = load_dataset("ise-uiuc/Magicoder-OSS-Instruct-75K")['train'].to_pandas()
ds2 = ds2[ds2['lang'] == 'python']

# Merge data
merged_instructions = [instruction for instruction in ds1['instruction']] + [instruction for instruction in ds2['problem']]
merged_responses =  [response for response in ds1['response']] + [response for response in ds2['solution']]

# Strip chain-of-thought prefix and/or suffix (if any) from responses
def _strip_chain_of_thought(response):
    return (
        response[response.find("```python") + len("```python") : response.rfind("```")]
        .lstrip()
        .rstrip()
    )

# Process merged data, save to JSON lines file
merged_ds_python_only = pd.DataFrame({
    'instruction': merged_instructions,
    'response':  [_strip_chain_of_thought(response) for response in merged_responses],
})

# Save to file
merged_ds_python_only.to_json('instruct_data/merged_oss_data_raw_pyt.jsonl', lines=True, orient='records')
```

## Generating lintseq synthetic edit sequences (s = 5)
```
python src/data/generate.py --seed 0 --source instruct_data/merged_oss_data_raw_pyt.jsonl --num_workers 24 --num_edit_paths_per_sample 5
```


## Generating linter-ablated, randomly sampled data (s = 5)
```
python src/data/ablate.py --seed 0 --source instruct_data/merged_oss_data_raw_pyt.jsonl --num_workers 24 --num_edit_paths_per_sample 5
```