
<p align="center" width="100%">
<a href="" target="_blank">

# FLASK: Fine-grained Language Model Evaluation Based on Alignment Skill Sets

[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

This is the official github repository for [FLASK: Fine-grained Language Model Evaluation Based on Alignment Skill Sets]

<p align="center">
  <img src="assets/overall_fig.png" width="80%" height="80%">
</p>

## Overview
This is the repository for the FLASK project, a task-agnostic evaluation protocol of language model with fine-grained instance-wise skill set as metrics.

## Evaluation
This code repository contains the implementation of model-based implementation of FLASK. For guidelines for human-based evaluation, refer to the paper.

### Step 1. OpenAI API Information
Since our model-based evaluation is based on GPT-4 evaluation, you need to add your OpenAI API key into the file of `openai_info/api_info.json` where you need to add your key in `api_keys` list. Note that we also support multiple keys for faster evaluation.

### Step 2. Model Inference
If you want to inference a model on the evaluation set of FLASK, run the following command.

```
cd model_output
python inference.py --model-path {model_path} --model-id {model_id} --question-file ../input_data/flask_evaluation_raw.jsonl --answer-file {output_directory} --num-gpus {num_gpus}
```

We provide the inference results of various LLMs on `model_output/outputs` directory. Note that for the inference of FLASK-Hard, you can simply replace the `--question-file` argument to `../evaluation_set/flask_hard_evaluation.jsonl`.

### Step3. Model Evaluation
After inference, we can evaluate the model using FLASK evaluation protocol. Run the following command for evaluation.

```
cd gpt_review
python gpt4_eval.py -q '../evaluation_set/flask_evaluation.jsonl' -a {answer_file} -o {output_review_file} -e {output_error_file}
```

Note that the error file exists for instances that fail due to rate limit of OpenAI API. If the error file is created after inference, you can rerun only the error instances by running the following command:

```
cd gpt_review
python gpt4_eval.py -q {output_error_file} -a {answer_file} -o {output_review_file} -e {new_output_error_file}
```

We provide the GPT-4 evaluation result of various models in `gpt_review/outputs` directory. 

### Step4. Aggregation and Analysis
After evaluation, FLASK enables fine-grained analysis depending on the skills, domains, and the difficulty levels. 

For analyzing the performance for each `skill`, run the following command:

```
cd gpt_review
python aggregate_skill.py -m {output_review_file}
```

For analyzing the performance for each skill depending on the `difficulty`, run the following command:

```
cd gpt_review
python aggregate_difficulty_skill.py -m {output_review_file}
```

For analyzing the performance for each `domain`, run the following command:

```
cd gpt_review
python aggregate_domain.py -m {output_review_file}
```

## Metadata Annotation
We also provide the implementation of the process of automatic metadata annotation of FLASK. 

### Step 1. OpenAI API Information
Since our model-based evaluation is based on GPT-4 evaluation, you need to add your OpenAI API key into the file of `openai_info/api_info.json` where you need to add your key in `api_keys` list. Note that we also support multiple keys for faster evaluation.

### Step 2. Domain Annotation
For domain metadata annotation, run the following command:

```
cd metadata_annotation/domain
python domain_annotation.py -o {output_domain_annotation} -e {output_domain_annotation_error} 
```

We define 10 different domains revising the domain categorization of Wikipedia. Note that the error file exists for instances that fail due to rate limit of OpenAI API. If the error file is created after inference, you can rerun only the error instances.


### Step 3. Skillset Annotation
For skillset annotation, run the following command:

```
cd metadata_annotation/skillset
python skillset_annotation.py -q {output_domain_annotation} -o {output_skill_annotation} -e {output_skill_annotation_error}
```

We define 12 skills for fine-grained evaluation of LLMs. 

### Step 4. Difficulty Annotation
For difficulty level annotation, run the following command:

```
cd metadata_annotation/difficulty
python difficulty_annotation.py -q {output_skill_annotation} -o {output_difficulty_annotation} -e {output_difficulty_annotation_error}
```

We define 5 different difficulty levels depending on the domain knowledge. Note that after Step 4, each row of the `output_difficulty_annotation` file should have the same format as the row of `evaluation_set/flask_evaluation.jsonl`.
