# FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLMs

This README provides the descriptions of the supplementary materials of FREAK. 

We sincerely apologize that due to time constraints, we were unable to fully organize all the code, data, intermediate results, and other materials for the project. In these supplementary materials, we have only included the most critical code. Furthermore, limited by the size restrictions of the supplementary materials, we have provided only a subset of the FREAK images.

## Overview

We begin by describing the composition of the supplementary materials.

```
/data_generation: contains core code for dataset generation
/inference: contains code for models' inference.
/eval: evaluate the results for inference output.
/analysis: contains experiment code of paper's Section Analysis.
/human_baseline: includes the process we build human baseline.
/generated_dataset: a sampling set of FREAK dataset.
```

Note that among the models we test, some models is special and need specific conda environment. We will further provide other scripts and codes soon. 

## Download Dataset

**To ICLR Reviewer: we have prepared the a subset containing 26 images for review, The dataset is in `./dataset`.**

The dataset contains 1786 CCS images with 1799 questions. For question type, FREAK contains 1,000 multiple-choice questions and 799 free-form questions, which are saved in `./dataset/dataset.json` and `./dataset/dataset_qa.json` respectively. The images of the questions are saved in `./dataset/generated_dataset/final_part2`. 

| Category  | Num. | Proportion（%） | Description                                                                                               |
| --------- | ---- |:-------------:| --------------------------------------------------------------------------------------------------------- |
| Detection | 612  | 34            | Requires models to identify salient structures of target objects.                                         |
| Attribute | 479  | 27            | Demands a description of geometric attributes (e.g. shape, size, color) for specified structures.         |
| Counting  | 414  | 23            | Evaluates models' ability to enumerate target architectures.                                              |
| Analysis  | 320  | 18            | Evaluates the models' inference capabilities based on visual content.                                     |
| Position  | 193  | 11            | Requires the model to determine the spatial locations or relationships of specific objects or structures. |
| OCR       | 139  | 8             | Challenges models to extract target text or locate specified characters from images.                      |

Notably, certain questions are assigned to more than one category.



## Dataset generation

we provide the object list we used for data generation during the construction of FREAK. The object list is from ImageNet-1K. 

For data generation, you can first run `./data_generation/download_dataset_content.py` to get CCS content first. After getting the content json file, you can use `./dataset_generation/generate_image.py` to generate normal image and then edit it to CCS images. 

Further, we provide a GUI application for data generation. With this application, you can first check if this item (CCS content and corresponding images), this app is for convenient filtering and verifying the data item, and can avoid low-quality image editing. Here is a picture of the GUI application.

![app](misc/app.png) 

Before use the data generation code, you should slightly modify the data to change the file path and replace *env.json* with your own API keys. We save the generated dataset in `./generated_dataset9`. For example, you can generate the data by yourself with following code:

```bash
python ./data_generation/download_dataset_content.py /
--model_name [MODEL-NAME] /
--ccs_content_save_dir [SAVE-DIR] /
--labels [OBJECT-LIST-FILE] /


python ./data_generation/generate_images.py /
 --ccs_content_file [CCS-CONTENT-FILE-PATH] /
 --edit ["True"/"False"]
```

Then you can view the data through json file and images. Or you can first only generate normal images, and then use GUI application for image choosing and editing:

```bash
python ./data_generation/annotation_app.py /
 --ccs_content_file [CCS-CONTENT-FILE-PATH] /
 --images_dir [IMAGES-DIR]
```

The results using GUI app will be saved at `./generated_dataset9/dataset2.json`

## Model Evaluation

Before inference models, check the integrity of the data file. In `./generated_dataset` directory, there should be 1. dataset.json 2. dataset_qa.json 3. final_part2 folder, containing 1,786 images.

#### Close-source models evaluation

1. replace env.json with your own API keys.

2. run:
   
   ```bash
   # Multiple-choice questions
   python inference/mcq.py --model_name [EVAL-MODEL-name] --model_famliy ['None'/'gemini'/'claude'/'glm'] --parallel_num [PARALLEL-NUM] --save_dir [OUTPUT-SAVE-DIR] --prompt_type ['wo' for normal prompt, 'cot' for CoT prompt]
   
   #Free-form questionspython inference/qa.py --model_name [EVAL-MODEL-NAME] --model_famliy ['None'/'gemini'/'claude'/'glm'] --parallel_num [PARALLEL-NUM] --thinking ['True'/'False' ENABLE_THINKING_MODE_OR_NOT]
   ```
   
   3. check `eval/model_result.py`, add the output file into *all_data*, we have given an example in `eval/model_result.py`

### Open-source models evaluation

   We use vLLM for open-source models evaluation. Kimi-VL-A3B models require vLLM 0.9.1, while other models we use version 0.10.1. 

1. Build deployment server for models.
   
   

```bash
vllm serve [MODEL-CKPT-PATH]  \
 --port [PORT]\
 --gpu-memory-utilization 0.8 \
 --max-model-len 8000 
 --tensor-parallel-size [PARALLEL_SIZE]\
 --served-model-name [DEPLOYMENT_NAME]\
 --trust-remote-code\
 --limit_mm_per_prompt "image=1"\
 --disable-log-requests

```



2. run:
   
   ```bash
   #Multiple-choice questions
   python inference/mcq_local.py --model_name [DEPLOYMENT_NAME] --api_url "http://[SERVED_IP]:[PORT]/v1" --parallel_num [PARALLEL-NUM] --save_dir [OUTPUT-SAVE-DIR] --judge_model_name [JUDGE-MODEL-NAME]
   
   #Free-form questions
   python inference/qa_local.py --model_name [DEPLOYMENT_NAME] --api_url "http://[SERVED_IP]:[PORT]/v1" --parallel_num [PARALLEL-NUM] --save_dir [OUTPUT-SAVE-DIR] --judge ["True"/"False" LLM-as-judge during inference] --judge_model_name [JUDGE-MODEL-NAME] --thinking "False"
   ```
   
   

    Note that some arguments are ignored for quick evaluation.

3. check eval/model_result.py, add the output file into *all_data*, we have given an example in `eval/model_result.py`

### Analysis Experiment

We have released the core code used in Section Analysis, which are listed in `./analysis`

## Human Baseline

We have publicly released the recorded results of 100 undergraduate students we employed to answer questions in FREAK, presented in .csv format. Due to constraints, the questionnaire was conducted using questions in Chinese, and the respondents also provided their answers in Chinese for the free-form questions. In the  `./human_baseline`folder, we have included the translated Chinese records used by the participants as well as their response results. Finally, we compiled and translated the participants' responses, providing a json file for human baseline statistics. 
