# Data Pre-processing

The processing code in under : 
```sh 
examples/data_preprocess/lean.ipynb 
```
All processed data are under the chat_template format:

```py
data = {
        "data_source": 'mff-lwb-goedel-28k',
        "prompt": [{
            "role": "user",
            "content": prompt
        }],
        "ability": "math",
        "reward_model": {
            "style": "rule",
            "ground_truth": ""
        },
        "extra_info": {
            'split': 'train',
            'index': idx
        }
    }
```

###  DeepSeek-Prover-V1.5
Is preprocessed under **data/processed_cot/** with the cot version of the prompt:

```py
    f"""
    Complete the following Lean 4 code with explanatory comments preceding each line of code:
    ```lean4
    import Mathlib
    import Aesop
    set_option maxHeartbeats 0
    open BigOperators Real Nat Topology Rat
    {problem}
    """
```


### DeepSeek-Prover-V2
Is preprocessed under **data/processed/** with the following prompt:
```py
    f"""
    Complete the following Lean 4 code:

    ```lean4
    import Mathlib
    import Aesop
    set_option maxHeartbeats 0
    open BigOperators Real Nat Topology Rat
    {problem}
    ``
    Before producing the Lean 4 code to formally prove the given theorem, provide a detailed proof plan outlining the main proof steps and strategies.
    The plan should highlight key ideas, intermediate lemmas, and proof structures that will guide the construction of the final formal proof.
    """

```


