# Introduction of our SuperCorrect-Qwen-Eval

## Brief Introduction

The json file contains the evaluation results details of our model SuperCorrect-Qwen. The evaluation result has been created from MATH evaluation task which consists of 5000 samples from the test set of MATH dataset.

Here we make a brief introduction for the features of our evaluation results.

## Evaluation Results Introduction

 For each sample, we have 7 different features to better present our results in detail. Here we will give an illustrative introduction to each of the features.

1.  **'instruction':** The 'instruction' corresponding to our Hierarchical Thought Template Reasoning Prompt (HT) as shown in the appendix of our paper
2.  **'input':** The 'input' corresponding to the problem from MATH dataset.
3. **'output':** The 'output' contains HT + Problem + Response from SuperCorrect-Qwen
4. **'xml_solution:'** For better presentation, we present the extracted solution in 'xml_solution'
5.  **'response_answer':** The answer extracted from response in 'response_answer'
6.  **'correct_answer':** We provide ground truth in 'correct_answer'
7. **'correct':** The final judgment about whether the 'response_answer' aligns with 'correct_answer'.

It should be noted that, MATH dataset contains various types of problems, so the final answer represents the same meaning may in different forms. To correctly recognize the correctness of the 'response_answer', we leverage GPT-4o-mini as a inspector to judge whether the 'response_answer' is correct compare to 'correct_answer'.

## Evaluation Results 

- Number of correct samples: 3514
- Total samples: 5000
- Accuracy: 70.28% 



