## Files Overview

### Data Files

- **`TEST_raw_ioformat.json`**: Standard input-output format data. This is the intermediate format after preprocessing raw code review feedback data.

- **`TEST_raw_prompt=input=code_output=cls_context=none_v0910_test_sharegptformat.json`**: Processed data ready for LLM training. This is the final format that can be directly fed to language models for training.

### Scripts

- **`preprocess.py`**: Converts raw CodeReviewFeedback data into standard IO format. This script:
  - Loads raw JSON data from a directory
  - Analyzes and filters the data
  - Maps bug types to standard categories
  - Extracts code diffs and repair information
  - Outputs standardized IO format JSON files

- **`export_sharegpt.py`**: Converts standard IO format data into ShareGPT format for LLM training. This script:
  - Reads IO format JSON files
  - Applies prompt templates using Jinja2
  - Formats data as user-assistant conversations
  - Generates ShareGPT-compatible training data

## Data Processing Flow

```
Raw CodeReviewFeedback Data
    ↓
[preprocess.py]
    ↓
Standard IO Format (TEST_raw_ioformat.json)
    ↓
[export_sharegpt.py]
    ↓
LLM Training Format (TEST_raw_prompt=..._sharegptformat.json)
```

