# 🧠Evaluation
## Response Generation
Our code supports the evaluation of both open-source models (such as Qwen2.5-VL, InternVL, LLaVA, etc.) and closed-source models (such as GPT, Claude, etc.).
###  single-image
For single-image reasoning,you can use the code under the Single_image_response directory to generate the responses of these models. Please note that by default, the responses generated by the models are md files and stored in the corresponding data folders.
### multi-image
For multi-image reasoning, you can use the following commands to generate responses from these models:

```bash
#!/bin/bash
 python ../response.py \
 --data_dir '../Data/PhyTest' \
 --strategy 'CoT' \
 --config_path '../Configs/default.yaml' \
 --model_path '../pretrained/InternVL2_5-78B' \
 --output_dir 'Result/PhyTest' \
 --max_tokens 1024 \
 --temperature 0.7 
 --subject 'PhyTest'  \
```
In addition, in the Scripts/instance directory, examples of multi-image inference generation responses for different models are provided.
## CoT Evaluation
### CoT Quality
For all CoT Quality metrics, you can generate the relevant json files through the file under Evaluation/quality_generate
### CoT Efficiency
For all CoT Efficiency metrics, you can generate the relevant json files through the file under Evaluation\efficiency_generate
### CoT Diversity
For all CoT Diversity metrics, you can generate the relevant json files through the file under Evaluation\solution_generate
## Score Calculation
Once all the CoT Evaluation json files have been generated, we can start calculating the scores. All the specific indicators can be calculated by accessing the code under Evaluation\metric_calculate.
Among them, CoT_F1_score.py calculates the CoT Quality score, cot_efficienc.py calculates the CoT Efficiency score, and Multi_Solution.py calculates the CoT Diversity score.
