# Data Processing

This directory contains code for dataset preparation and processing. It's designed to be a place where custom dataset logic can be implemented for various experiments and training tasks.

## Current Implementation

The current implementation focuses on the [BC-MBPP](https://huggingface.co/datasets/gabeorlanski/bc-mbpp) dataset:

### Files

- `__init__.py`: Package initialization file
- `bc_mbpp_utils.py`: Contains the `BCMBPPUtils` class with utility functions for processing the BC-MBPP dataset:
- `prepare_bc_mbpp_dataset.py`: Contains functions for preparing and loading the BC-MBPP dataset:
  - `prepare_dataset()`: Loads and filters the dataset
  - `mock_prediction()`: Creates mock predictions using ground truth solutions
  - `get_mbpp_q_data_dict()`: Prepares the dataset with additional fields
  - `get_mbpp_q_data()`: Gets a specific split from the prepared dataset

## Implementing Custom Dataset Logic

To implement custom dataset logic for your own data:

1. Create a new Python file for dataset preparation (e.g., `prepare_custom_dataset.py`)
2. Create a utils class to store processing logic and constants (e.g., `CustomDatasetUtils`)
3. Implement functions for loading, processing, and preparing your dataset

## Integration

To use your custom dataset in the [`main.py`](../main.py) script:

1. Add your loading dataset logic in the `load_and_prepare_dataset` function.
2. Update `Dataset settings` section in the config file.
