This README.md helps to understand our supplementary material of code.

- The fold named ``benchmark_construction`` contains all codes related to construct CRITIC-math.
- The fold named ``Task_1`` contains codes related to Research Question 1, i.e. evaluating LRMs.
  - Coarse-level evaluation are placed in the fold ``Task_1``.
  - Fine-grained level evaluation are placed in the fold ``Task_1_analysis``
- The fold ``Task_3`` contains codes related to Research Question 2, i.e. evaluating the SFT effectiveness.

To run our codes, you need configure the API in the fold named ``common``, specifically, the file ``common/model_configs.py``.

We provide the shells required to run our codes:

- ``shells/full_1.sh`` contains commands to construct the test set of CRITIC-math.
- ``shells/full_2.sh`` contains commands to construct the training set of CRITIC-math.
- ``shells/math_task_1_new`` contains commands to evaluate LRMs. In addition, ``math_task_1_new_m`` contains the code to sample multiple runs.
- ``shells/math_task_1_analysis`` contains commands to fine-grained evaluate LRMs' thoughts.
- ``shells/math_task_3_sft`` contains commands to sft data and evaluate.