# ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

This repository contains the code, data, and the example of ConsistencyCheck Benchmark of our paper for reproduction.

### Step1: Experiment Enviroment
To ensure reproducibility, we build the training environment based on the docker environment of the open-source Slime framework.

```bash
docker pull slimerl/slime:latest # This is an open-source environment and does not violate anonymity.
# install addtional packages
cd ./slime
pip install -e .

pip install json5
```


### Step2: Download LLM from Huggingface
Please download the following models from huggingface:
```
Qwen3-8B
Qwen3-32B
Qwen3-235B-A22B-Thinking
CriticLean-14B
```

### Step3: Deploy CriticLean-14B and Qwen3-235B-A22B with SGlang
```bash
# set your model path before launch the server
cd ./launch_llm
bash launch_sglang.sh
```

### Step4: Convert models from Huggingface to Megatron format
```bash
cd ./convert_hf_to_megatron
bash hf2mcore.sh
```

### Step5: SFT
```bash
# set parameters in qwen3-8B.sh before training
cd ./sft_scripts
bash qwen3-8B.sh
```


### Step6: RL

```bash
# set parameters in rl_qwen3-8B.sh before training
cd ./rl_scripts
bash qwen3-8B-PBSO.sh
```

## ConsistencyCheck Benchmark

* We provide our proposed ConsistencyCheck benchmark in `./benchmark`
*   **`informal_statement`**: This field contains the natural language description of the mathematical problem, as provided in the original dataset.
*   **`formal_statement`**: This field contains the corresponding formal statement written in Lean 4, which serves as the ground-truth formalization from the original dataset.
*   **`header`**: A necessary block of code, including `import` statements and other setup commands, required for the `formal_statement` to be successfully processed by the Lean 4 compiler.
*   **`human_check`**: A human-annotated label indicating the result of our consistency validation. It typically takes a binary value (e.g., "Correct" or "Incorrect") to specify whether the generated formalization is semantically faithful to the `informal_statement`.
*   **`human_reason`**: A detailed, human-written annotation that explains the reasoning behind the `human_check` label. If a formalization is deemed incorrect, this field pinpoints the specific location of the error and describes the nature of the discrepancy.


## Others
* We provide our sampled sub-data for sft and rl data in `./data`
* We provide our proposed ConsistencyCheck benchmark in `./benchmark`
* We will release our trained checkpoint after the review process.
* Special thanks to Slime, SGlang, and VLLM for their valuable work.


# Case Study: Trajectory

We have provided the trajectory randomly sampled from ReForm-8B to support further case studies in `./case_study/traj_case_study.jsonl`.

With the above information, we believe you can easily reproduce our work.