# College Entrance Examination Evaluation Results
This folder contains all the inference code, generation results, and detailed scoring information for the models.

## File Structure
The structure of the evaluation results is shown below. The folder structure is `Exam Name`-`Subject`-`Jupyter Notebook file displaying model output`.
```
results/
├────── README.md      # Testing instructions
├────── New Curriculum Paper/       # A folder created for each type of college entrance examination paper
│       ├── README.md   # Summary of scores for the corresponding examination paper
│       ├── Mathematics/       # Generation results from various models
│       │   ├── New Curriculum I Mathematics_Mixtral-8x22B-Instruct-v0.1.ipynb
│       │   ├── New Curriculum I Mathematics_Qwen2-57B-A14B-Instruct.ipynb
│       │   ├── New Curriculum I Mathematics_Qwen2-72B-Instruct.ipynb
│       │   ├── New Curriculum I Mathematics_Yi-1.5-34B-Chat.ipynb
│       │   ├── New Curriculum I Mathematics_glm-4-9b-chat.ipynb
│       │   ├── New Curriculum I Mathematics_gpt-4o.ipynb
│       │   └── New Curriculum I Mathematics_WQX-20b.ipynb
│       ├── English/
│       └── Chinese/
└────── National A Paper/      # A folder created for each type of college entrance examination paper
```

## Example Questions
### Mathematical Formulas
Before inputting questions into the large model, we convert the input into text form. For mathematical questions that involve formulas, we will use LaTeX format for representation, as shown in the question below:

![image](https://github.com/OpenMOSS/CoLLiE/assets/65400838/2ad7393b-a93b-4ebf-a5d6-a81d5f9e4162)

> However, we encountered a minor oversight when dealing with large mathematical questions, as we did not include sub-question numbers. The content seen by the model is shown in the image above without sub-question identifiers, but during the actual evaluation process, we found that most models are able to recognize that these are three different sub-questions.

This will be converted into the following content as input for the model:
```latex
已知函数 $f(x)=\ln \frac{x}{2-x}+a x+b(x-1)^3$.
若 $b=0$, 且 $f^{\prime}(x) \geqslant 0$, 求 $a$ 的最小值.
证明: 曲线 $y=f(x)$ 是中心对称图形.
若 $f(x)>-2$, 当且仅当 $1<x<2$, 求 $b$ 的取值范围.
```

During inference, the `max_new_token` for each model is set to `2048`, and all models, except for Chinese and English essays, use a `greedy decoding` strategy.

### Questions with Images
For multimodal questions that include images, the images in the questions are embedded in HTML format, for example:

In a physical education class, two students are playing badminton indoors, and the trajectory of the shuttlecock's ascent is indicated by the dashed line in the figure. Considering air resistance, the possible correct direction of the shuttlecock's acceleration is ( )
- A: `<img alt="" height="59px" src="data/img/0_0.png" style="vertical-align:middle;" width="149px"/>`
- B: `<img alt="" height="57px" src="data/img/0_1.png" style="vertical-align:middle;" width="130px"/>`
- C: `<img alt="" height="65px" src="data/img/0_2.png" style="vertical-align:middle;" width="144px"/>`
- D: `<img alt="" height="56px" src="data/img/0_3.png" style="vertical-align:middle;" width="140px"/>`

The script will extract these images and display them in a composite image. In the composite image, we will mark the corresponding positions as `<IMAGE i>` and replace the original images in the questions with `<IMAGE i>`. As shown in the figure below:

Two students are playing badminton indoors during physical education class. The trajectory of the shuttlecock's ascent is indicated by the dashed line in the figure. Considering air resistance, which of the following diagrams correctly represents the possible direction of the shuttlecock's acceleration? ( )

- A: `<IMAGE 0>` 
- B: `<IMAGE 1>` 
- C: `<IMAGE 2>` 
- D: `<IMAGE 3>`

<img src="https://ks-1302698447.cos.ap-shanghai.myqcloud.com/img/phymerge.png" alt="web_ui_wqx_2" style="zoom:100%;" />


```python
import urllib.request
import shutil
import re
from PIL import Image, ImageDraw, ImageFont
import matplotlib.pyplot as plt


def img_process(im_list):
    imgs = []
    for p in im_list:
        try:
            imgs.append(Image.open(p))
        except:
            return -1
    new_w = 0
    new_h = 0
    for im in imgs:
        w, h = im.size
        new_w = max(new_w, w)
        new_h += h + 20
    new_w += 20
    new_h += 20
    pad = max(new_w // 4, 100)
    font = ImageFont.truetype("src/fonts/SimHei font.ttf", pad // 5)
    new_img = Image.new('RGB', (new_w + pad, new_h), 'white')
    draw = ImageDraw.Draw(new_img)
    curr_h = 10
    for idx, im in enumerate(imgs):
        w, h = im.size
        new_img.paste(im, (pad, curr_h))
        draw.text((0, curr_h + h // 2), f'<IMAGE {idx}>', font=font, fill='black')
        if idx + 1 < len(imgs):
            draw.line([(0, curr_h + h + 10), (new_w + pad, curr_h + h + 10)], fill='black', width=2)
        curr_h += h + 20

    plt.imshow(new_img)
    plt.title("Processed Image")
    plt.show()
    return new_img


sample = questions[0]
question = sample['q_main']
mid_prompt = question

pattern_img_tag = re.compile(r'<img alt=.*?"/>')
pattern_src = re.compile(r'src=".*?"')

imgs = pattern_img_tag.findall(mid_prompt)
im_list = []
if len(imgs) == 0:
    img = None
else:
    for i, img in enumerate(imgs):
        mid_prompt = mid_prompt.replace(img, f'<IMAGE {i}> ', 1)
        img = pattern_src.findall(img)[0].split('"')[1]
        if img.startswith("data/img/"):
            shutil.copy(img, f"data/img_cache/sample_{i}.png")
        else:  # URL
            urllib.request.urlretrieve(img, f"data/img_cache/sample_{i}.png")
        im_list.append(f"data/img_cache/sample_{i}.png")

processed_img = img_process(im_list)
```

## Jupyter Notebook Recording Process

To ensure the reproducibility of the answering process, this project uses Jupyter Notebook to record the answering situation of each model. In each Notebook, the model's output is compared with the standard answers to demonstrate the model's answering ability. Below is the basic structure of the Notebook and a description of the functionality of each cell:

### Notebook Structure
- **Cell 1**: Uses Markdown to record paper information, question standard answers, and model outputs, providing a detailed analysis of the model outputs for easy viewing of the model's problem-solving thought process and results.
- **Cell 2**: Contains scripts for loading the model and related libraries, initializing the model for subsequent inference.
- **Cell 3**: Conducts model inference, prints the problem-solving record, including comparisons between model outputs and standard answers, to facilitate evaluation of the model's performance.

### Example Notebook Content
#### Cell 1: Paper Information and Analysis
In this cell, Markdown format is used to record paper information, question standard answers, and model outputs, along with detailed analyses of the model outputs for browsing and understanding.

```markdown
# 试卷名：新课标卷Ⅰ 高考真题 【数学】学科

## 题目编号：1
## 题目标答
因为$  A = \left\{ x | - \sqrt [ 3 ] { 5 } < x < \sqrt [ 3 ] { 5 } \right\} $ ， 又$  \sqrt [ 3 ] { 5 } < \sqrt [ 3 ] { 8 } = 2$  ，故$ A\cap B=\{-1,0\}$ ． 故选$ \text{A}$ ．

## 模型输出

...

------
```

#### Cell 2: Model Loading Script
Load the required models and related libraries, and perform initialization. To ensure fairness, all prediction inferences, except for essays, are generated using **greedy** generation. The `max_length` is set to `2048`, with only a small number of generation results truncated due to repeated generation.

```python
import re
import json

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device("cuda")

model_path = "path-to-model"
gen_kwargs = {"max_length": 2048, "do_sample": False}

tokenizer = AutoTokenizer.from_pretrained(
    model_path, trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype=torch.float16,
).eval().to(device)
```

#### Cell 3: Model Inference and Problem-Solving Record
Conduct model inference and print the problem-solving record, including comparisons between model outputs and standard answers.

```python
subject, paper_type = "数学", "新课标卷Ⅰ"
file_name = f"../data/{paper_type}/{subject}.jsonl"

questions = []

print(f"试卷名：{paper_type} 高考真题 【{subject}】学科")

with open(file_name, "r") as f:
    for i, line in enumerate(f):
        data = json.loads(line)
        has_img, question = False, data['prompt']

        if '<img' in question:
            has_img = True
            question = re.sub(r'<img[^>]*?/>', "", question)

        inputs = tokenizer.apply_chat_template(
            [{"role": "user", "content": question}],
            add_generation_prompt=True,
            tokenize=True,
            return_tensors="pt",
            return_dict=True
        )
        inputs = inputs.to(device)

        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            
            response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        if i == 0:
            print("*" * 35)
        else:
            print("*" * 15)

        print("题目编号：" + str(i+1) + ("（含图片）" if has_img else ""))
        print("题目标答：" + data["answer"])
        print("模型输出：" + response)
            
        questions.append({
            "id": str(i+1),
            "question": question,
            "answer": data["answer"],
            "output": response,
            "has_img": has_img
        })
```
