# KoLA
Knowledge-oriented LLM Assessment benchmark (KoLA)  aims at carefully benchmarking the world knowledge of LLMs by undertaking meticulous designs considering data, ability taxonomy and evaluation metric.

![kola](./kola.png)

## Data Acquisition

We provide researchers and developers interested in participating in the evaluation with files under `Sample_data/`. This file folder provides sample data and data descriptions for all tasks. 

Each file includes 5 examples for each established task in KoLA, as well as a detailed Readme file for description.

## Evaluation Scripts

We provide a sample evaluation script for each task. The script is used to evaluate the performance of the model on the task. The script will be executed in the following way:

```bash
python eval/<dataset_id>_evaluate.py <input_file>  <output_file>
```
- `<dataset_id>`: the id of the dataset, e.g., `3-4_kqapro`.
- `<input_file>`: the inference result of the model, e.g., `kqapro_inference.json`.
- `<output_file>`: the output score file of the evaluation, e.g., `kqapro_evaluation.json`.

## Result Analysis Tools

To facilitate the reproducibility of the paper's results, we provide tools for each model's absolute performance, as well as a series of operations such as standardization and visualization. These tools can also assist subsequent contributors in obtaining results in advance during the leaderboard waiting period and ensure the fairness of our results.

First, we have put the 21 models' raw evaluation results in `analyse/results/`. 
To get standardization scores, run the following command:

```bash
python analyse/unify.py
```

Then, the standardized results will be saved in `analyse/dataset_final_scores.csv`.

Second, we provide a visualization tool to help you visualize the results. You can run `analyse/plot.ipynb` to get *spearman correlation* and *scatter plots*.

