# Ablation Study Metrics Comparison
| Evaluation Setting               |   Pairwise Agreement |   Model Mean Correlation |   Average Pearson Coefficient |   Average Spearman Coefficient |   Overall Score |
|:---------------------------------|---------------------:|-------------------------:|------------------------------:|-------------------------------:|----------------:|
| Human Evaluators                 |                68.44 |                   nan    |                        nan    |                         nan    |           68.44 |
| reference_first_ablation_exp_v1  |                69.56 |                    97.20 |                         56.75 |                          55.49 |           69.75 |
| 基准线 (Baseline)                |                71.33 |                    99.54 |                         60.24 |                          59.12 |           72.56 |
| 无维度权重 (No Dim Weights)      |                70.89 |                    99.54 |                         60.11 |                          57.22 |           71.94 |
| vanilla_prompt_ablation_exp_v1   |                58.89 |                    98.89 |                         40.30 |                          43.75 |           60.46 |
| 无标准权重 (No Criteria Weights) |                70.67 |                    99.62 |                         59.83 |                          56.27 |           71.60 |
| static_criteria_ablation_exp_v1  |                68.33 |                    98.73 |                         57.86 |                          57.70 |           70.65 |
| 无权重 (No Weights)              |                71.11 |                    99.69 |                         59.46 |                          58.17 |           72.11 |
| no_reference_ablation_exp_v1     |                66.56 |                    97.46 |                         57.51 |                          51.23 |           68.19 |
## Metric Explanations:
*   **Pairwise Agreement**: Degree of agreement between the model's (or within human evaluators') preference and the average preference of human evaluators. Higher is better.
*   **Model Mean Correlation**: Pearson correlation between the average scores of each model and the corresponding human average scores. Higher is better.
*   **Average Pearson Coefficient**: Pearson correlation calculated for each prompt between model scores and human average scores, then averaged. Higher is better (based on ICC filtered prompts).
*   **Average Spearman Coefficient**: Spearman rank correlation calculated for each prompt between model scores and human average scores, then averaged. Higher is better (based on ICC filtered prompts).
*   **Overall Score**: Arithmetic mean of the four core metrics above. Higher is better.
