# GAOKAO-Eval

## Introduction

The Gaokao, one of China's most authoritative examinations, encompasses a wide range of subjects and question types, aiming to comprehensively assess the abilities of examinees. As such, it serves as an excellent benchmark for evaluating large models. We have selected open-source models released before June 6, 2024, and the current state-of-the-art model, GPT-4o, for a thorough evaluation using the 2024 Gaokao exam papers. Unlike previous evaluations that focused solely on objective questions, this assessment includes various types of Gaokao questions, such as multiple-choice, problem-solving, reading comprehension, and essays. All subjective questions were graded by active high school teachers to provide a comprehensive evaluation of the current large models' capabilities.

GAOKAO-Eval is characterized by the following four features:

- **Comprehensive Examination**: The evaluation covers the entire exam, not just specific question types, including Gaokao questions with images.
- **Pre-Exam Open Source**: The evaluation includes only models that were open-sourced before the Gaokao exam, eliminating the possibility of leaked questions.
- **Teacher Grading**: Experienced Gaokao examiners were invited to grade the answers, ensuring consistency with the official grading standards.
- **Fully Transparent**: The code for generating answers, model responses, and grading results are fully open-sourced.

> **As with the Gaokao scores, this evaluation cannot achieve absolute fairness. The scores are merely a reference. To ensure objectivity, each question was graded by at least three teachers, and discrepancies were recalibrated.**

> **It is important to note that large models make mistakes differently from human examinees. Teachers may not be fully accustomed to grading large models, leading to potential misjudgments.**

> **Additionally, we observed significant score variability across different Gaokao papers for large models, resulting in noticeable variations in scores or rankings across different provinces and cities.**

> **Note that this evaluation only assesses large language models' performance on Gaokao questions and does not comprehensively evaluate the models' capabilities. Therefore, the ranking based on Gaokao scores does not reflect the quality of the model's user experience or overall ability.**

## Recent Developments

- **[2024.07.17]** Completed evaluations of six open-source models on eight subjects of the National Paper A, excluding politics. Click [National Paper A Results](./results/全国甲卷/README.md) for details.
- **[2024.07.17]** Completed evaluations of six open-source models on six subjects of the New Curriculum Standard Paper. Click [New Curriculum Standard Paper Results](./results/新课标/README.md) for details; corrected the evaluation of question 10 in the New Curriculum Standard I Mathematics Paper; added Gradio invocation script.
- **[2024.06.15]** Completed evaluations of six open-source models on three subjects of the New Curriculum Standard I Paper (Chinese, Mathematics, and English). Click [New Curriculum Standard Paper Results](./results/新课标/README.md) for details.

## Exam Types

With the reform of the Gaokao, six types of national exam papers were available in 2024. The Beijing, Shanghai, Tianjin, and National Paper A cover all subjects, while provinces using the New Curriculum Standard I and II Papers use corresponding language and mathematics exams. Most provinces independently set non-language and mathematics subjects. In GAOKAO-Eval, we tested all publicly available papers of the New Curriculum Standard and National Paper A.

| Exam Type                  | Provinces/Cities                                             |
| -------------------------- | ------------------------------------------------------------ |
| New Curriculum Standard I  | Guangdong, Fujian, Hubei, Hunan, Jiangsu, Hebei, Shandong, Zhejiang, Jiangxi, Anhui, Henan |
| New Curriculum Standard II | Liaoning, Chongqing, Hainan, Shanxi, Xinjiang, Guangxi, Guizhou, Heilongjiang, Gansu, Jilin, Yunnan, Tibet |
| New Curriculum Standard    | Shanxi, Henan, Yunnan, Tibet, Xinjiang                       |
| National Paper A           | Sichuan, Inner Mongolia, Ningxia, Shaanxi, Qinghai           |
| Beijing Paper              | Beijing                                                      |
| Shanghai Paper             | Shanghai                                                     |
| Tianjin Paper              | Tianjin                                                      |

The current Gaokao system primarily divides into three major categories:

- **"3+1+2" New Model**: Adopted by 23 provinces, this model revolves around three core subjects: Chinese, Mathematics, and Foreign Language. Students choose one primary subject from Physics or History and select two from the remaining four subjects (Political Science, Geography, Chemistry, Biology).
- **"3+3" Model**: Used by six provinces, students complete the three core subjects and freely choose three from six subjects (Political Science to Biology, with Zhejiang including a technical subject) as electives.
- Five provinces still use the **National Paper A** with a traditional division of arts and sciences.

## Score Summary

### 1. New Curriculum Standard Paper

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: center;">
      <th colspan="13" style="text-align: center;">New Curriculum Standard&dagger; (sorted by total score of science)</th>
    </tr>
  </thead>
  <tbody>
    <tr style="text-align: center;">
      <td>Model</td>
      <td>Research Institution</td>
      <td>Chinese</td>
      <td>Mathematics</td>
      <td>English</td>
      <td>Physics</td>
      <td>Chemistry</td>
      <td>Biology</td>
      <td>History</td>
      <td>Geography</td>
      <td>Political Science</td>
      <td>Total Science Score</td>
      <td>Total Humanities Score</td>
	</tr>
    <tr style="text-align: center;">
      <td>WQX+VL-20B</td>
      <td>Ours</td>
      <td>112</td>
      <td>74</td>
      <td>138.5</td>
      <td>39</td>
      <td>48</td>
      <td>57</td>
      <td>82</td>
      <td>58</td>
      <td>67</td>
      <td>468.5</td>
      <td>531.5</td>
    </tr>
      <tr style="text-align: center;">
      <td>GPT-4o</td>
      <td>OpenAI</td>
      <td>111.5</td>
      <td>73</td>
      <td>141.5</td>
      <td>36</td>
      <td>40</td>
      <td>65</td>
      <td>88</td>
      <td>59</td>
      <td>58</td>
      <td>467</td>
      <td>531</td>
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2-72B text only</td>
      <td>Alibaba</td>
      <td>124</td>
      <td>68</td>
      <td>139</td>
      <td>42</td>
      <td>44</td>
      <td>48</td>
      <td>85</td>
      <td>70</td>
      <td>60</td>
      <td>465</td>
      <td>546</td>
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2-72B+VL-7B</td>
      <td>Alibaba</td>
      <td>124</td>
      <td>68</td>
      <td>139</td>
      <td>19</td>
      <td>6</td>
      <td>48</td>
      <td>85</td>
      <td>4</td>
      <td>60</td>
      <td>404</td>
      <td>480</td>
    </tr>
    <tr style="text-align: center;">
      <td>Yi-34B+VL-34B</td>
      <td>01.AI</td>
      <td>97</td>
      <td>31</td>
      <td>134.5</td>
      <td>21</td>
      <td>37</td>
      <td>49</td>
      <td>48</td>
      <td>41</td>
      <td>51</td>
      <td>369.5</td>
      <td>402.5</td>
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2-57B+VL-7B</td>
      <td>Alibaba</td>
      <td>99.5</td>
      <td>58</td>
      <td>126.5</td>
      <td>7</td>
      <td>6</td>
      <td>51</td>
      <td>73</td>
      <td>4</td>
      <td>62</td>
      <td>348</td>
      <td>423</td>
    </tr>
    <tr style="text-align: center;">
      <td>GLM4-9B+VL-9B</td>
      <td>Zhipu AI</td>
      <td>86</td>
      <td>48</td>
      <td>97</td>
      <td>18</td>
      <td>27</td>
      <td>67</td>
      <td>80</td>
      <td>62</td>
      <td>48</td>
      <td>343</</td>
      <td>421</td>
    </tr>
    <tr style="text-align: center;">
      <td>Mixtral 8x22B</td>
      <td>Mistral</td>
      <td>77.5</td>
      <td>21</td>
      <td>116.5</td>
      <td>25</td>
      <td>35</td>
      <td>46</td>
      <td>54</td>
      <td>56</td>
      <td>38</td>
      <td>321</td>
      <td>363</td>
    </tr>
  </tbody>
</table>

&dagger; indicates that the assessment uses the New Curriculum Standard I Paper for Chinese, Mathematics, and English, along with the New Curriculum Standard Paper for Arts and Science Comprehensive Tests.

If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.

For more detailed scores and model outputs, please refer to  [New Curriculum Standard Paper Results](./results/新课标/README.md).

### 2. National Paper A

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: center;">
      <th colspan="13"  style="text-align: center;">Score of National Paper A (sorted by total score of science)</th>
    </tr>
  </thead>
  <tbody>
    <tr style="text-align: center;">
      <td>Model</td>
      <td>Research Institution</td>
      <td>Chinese</td>
      <td>English</td>
      <td>Mathematics (Science)</td>
      <td>Physics</td>
      <td>Chemistry</td>
      <td>Biology</td>
      <td>Mathematics (Arts)</td>
      <td>History</td>
      <td>Geography</td>
      <td>Total Science Score</td>
      <td>Total Humanities Score (Excluding Politics)</td>
	</tr>
    <tr style="text-align: center;">
      <td>Qwen2-72B text only</td>
      <td>Alibaba</td>
      <td>128</td>
      <td>141</td>
      <td>89</td>
      <td>32</td>
      <td>48</td>
      <td>50</td>
      <td>95</td>
      <td>71</td>
      <td>81</td>
      <td>488</td>
      <td>516</td>
    </tr>
    <tr style="text-align: center;">
      <td>GPT-4o</td>
      <td>OpenAI</td>
      <td>122</td>
      <td>142.5</td>
      <td>84</td>
      <td>31</td>
      <td>34</td>
      <td>72</td>
      <td>89</td>
      <td>82</td>
      <td>66</td>
      <td>485.5</td>
      <td>501.5</td>
    </tr>
    <tr style="text-align: center;">
      <td>WQX+VL-20B</td>
      <td> Ours</td>
      <td>111</td>
      <td>141</td>
      <td>78</td>
      <td>30</td>
      <td>52</td>
      <td>50</td>
      <td>71</td>
      <td>76</td>
      <td>64</td>
      <td>462</td>
      <td>463</td>
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2-72B+VL-7B</td>
      <td>Alibaba</td>
      <td>128</td>
      <td>141</td>
      <td>89</td>
      <td>22</td>
      <td>22</td>
      <td>50</td>
      <td>95</td>
      <td>71</td>
      <td>34</td>
      <td>452</td>
      <td>469</td>
    </tr>
    <tr style="text-align: center;">
      <td>Mixtral 8x22B</td>
      <td>Mistral</td>
      <td>92</td>
      <td>142</td>
      <td>58</td>
      <td>38</td>
      <td>39</td>
      <td>54</td>
      <td>53</td>
      <td>74</td>
      <td>74</td>
      <td>423</td>
      <td>435</td>
    </tr>
    <tr style="text-align: center;">
      <td>GLM4-9B+VL-9B</td>
      <td>Zhipu AI</td>
      <td>108</td>
      <td>110.5</td>
      <td>71</td>
      <td>29</td>
      <td>44</td>
      <td>55</td>
      <td>75</td>
      <td>54</td>
      <td>62</td>
      <td>417.5</td>
      <td>409.5</td>
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2-57B+VL-7B</td>
      <td>Alibaba</td>
      <td>108</td>
      <td>141</td>
      <td>65</td>
      <td>6</td>
      <td>22</td>
      <td>44</td>
      <td>75</td>
      <td>77</td>
      <td>30</td>
      <td>386</td>
      <td>431</td>
    </tr>
    <tr style="text-align: center;">
      <td>Yi-34B+VL-34B</td>
      <td>01.AI</td>
      <td>109</td>
      <td>107.5</td>
      <td>39</td>
      <td>15</td>
      <td>40</td>
      <td>55.5</td>
      <td>65</td>
      <td>53</td>
      <td>54</td>
      <td>366</td>
      <td>388.5</td>
    </tr>
  </tbody>
</table>


If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.

For more detailed scores and model outputs, please refer to [National Paper A Results](./results/全国甲卷/README.md).

> Teachers were not informed that the answers were generated by large models before grading.

> Some models may completely misunderstand questions, generate repetitive answers, or provide analyses instead of solutions. Teachers confirmed these issues with us, and we instructed them to consider such errors as incorrect answers.

> Some teachers noted a 1-2 point margin of error in essay grading due to the absence of handwritten answers.

## Model Overview

We evaluated large models from Alibaba, Zero One World, Zhipu AI, WQX, Mistral, and OpenAI. Gaokao questions include many image-based questions. Language models only answered text-based questions (with few exceptions), while multimodal models answered all questions. We selected open-source models released before June 6, 2024, and the most advanced GPT-4o as a reference. The participating models are listed below:

|                     | Research Institution        | Model Type | Model Description                                            | Weight Upload Date | Model Link                                                   |
| ------------------- | --------------------------- | ---------- | ------------------------------------------------------------ | ------------------ | ------------------------------------------------------------ |
| WQX-20B    | Ours | LLM        | Used for GAOKAO-Eval| 2024.06.04         | anonymization
| WQX-20B-VL | Ours | MLLM       | Used for GAOKAO-Eval | 2024.06.04         | anonymization |
| Qwen2-72B           | Alibaba                     | LLM        | The largest language model in the Qwen2 series released by Alibaba. | 2024.05.28         | [🤗HuggingFace](https://huggingface.co/Qwen/Qwen2-72B-Instruct) |
| Qwen2-57B           | Alibaba                     | LLM        | An MoE language model in the Qwen2 series released by Alibaba. | 2024.05.04         | [🤗HuggingFace](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct) |
| QwenVL-7B           | Alibaba                     | MLLM       | A multimodal language model released by Alibaba.             | 2023.09.25         | [🤗HuggingFace](https://huggingface.co/Qwen/Qwen-VL-Chat)     |
| Yi-1.5-34B          | 01.AI                       | LLM        | The largest language model in the Yi 1.5 series released by 01.AI. | 2024.05.12         | [🤗HuggingFace](https://huggingface.co/01-ai/Yi-1.5-34B-Chat) |
| Yi-VL-34B           | 01.AI                       | MLLM       | A large multimodal language model released by 01.AI.         | 2024.01.19         | [🤗HuggingFace](https://huggingface.co/01-ai/Yi-VL-34B)       |
| GLM4-9B             | Zhipu AI                    | LLM        | The open-source version of the latest generation pre-trained model in the GLM-4 series released by Zhipu AI. | 2024.06.04         | [🤗HuggingFace](https://huggingface.co/THUDM/glm-4-9b-chat)   |
| GLM-4v-9B           | Zhipu AI                    | MLLM       | The multimodal model in the latest generation pre-trained model in the GLM-4 series released by Zhipu AI. | 2024.06.04         | [🤗HuggingFace](https://huggingface.co/THUDM/glm-4-9b-chat)   |
| Mixtral 8x22B       | Mistral                     | LLM        | The most powerful language model currently open-sourced by the French AI startup Mistral. | 2024.04.17         | [🤗HuggingFace](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) |
| GPT-4o              | OpenAI                      | MLLM       | The most powerful large language model released by OpenAI, currently the leading LLM in the world. | 2024.05.13         | [OpenAI](https://openai.com/index/hello-gpt-4o/)             |

## File Structure

The project's file structure is as follows:

```
├── README.md
├── results/
│   ├── README.md
│   └── New Curriculum/      # A folder for each Gaokao paper type
│   │   ├── README.md  # Summary of scores for each Gaokao paper
│   │   ├── Mathematics/       # Jupyter notebooks displaying model answers
│   │   │   ├── New Curriculum I Mathematics_Mixtral-8x22B-Instruct-v0.1.ipynb
│   │   │   └──...
│   │   ├── English/
│   │   ├── Chinese/
│   │   ├── Chemistry/
│   │   └── ...
│   └── National Paper A/
│       ├── README.md
│       ├── Arts and Sciences Mathematics/
│   │   │   ├── National Paper A Arts and Sciences Mathematics_Mixtral-8x22B-Instruct-v0.1.ipynb
│   │   │   └──...
│       └── ...
```

## Evaluation Method

In this evaluation, images from the Chinese, Mathematics, and English exams were discarded, and only the text was input into the models (in the New Curriculum Standard I exams, only Mathematics included two image-based questions, which had minimal impact on understanding and answering). For English listening sections (worth 30 points), all models were assumed to have full scores. For arts and sciences exams, image-based questions were answered using the multimodal versions of the models, while text-only questions were answered by the language models. The parameters, prompts, outputs, and scores for all models are open-sourced in this repository.

### Evaluation Method for Multimodal Questions

Since the Mixtral series includes only language models, only language models were used for multimodal questions. Due to poor performance of the QwenVL-7B on the New Curriculum Standard geography exam (scoring only 4 points), we also evaluated the Qwen2-72B text model for answering the multimodal questions in physics, chemistry, and geography on both the New Curriculum Standard and National Paper A exams.

For details on handling multimodal question images, refer to [Multimodal Question Image Processing](./results/README.md#题目图片).

### Grading Method by Human Teachers

Teachers were not informed that answers were generated by large models before grading. However, due to some models completely misunderstanding questions, generating repetitive answers, or providing analyses instead of solutions, teachers confirmed these issues with us during grading.

Additionally, there may be a 1-2 point margin of error in essay grading due to the absence of handwritten answers.

## Acknowledgments

We sincerely thank all the high school teachers who participated in this project. The outputs of large models present various challenges, and the teachers have shown great patience and diligence in grading. We appreciate their efforts immensely.
