# WebDevJudge

This repository contains the code for the paper "WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality".

To run the code, first download the data from https://huggingface.co/datasets/lmarena-ai/webdev-arena-preference-10k, process it into the JSON Lines format, rename it to `full.jsonl`, and place it in the `data` folder. Then, extract the data by question_id from `new_labels.json`.

To run the Likert scale evaluation, please add your Azure OpenAI API key to `api_keys/openai.json`.

Then you can run the following command to evaluate the models:

```bash
bash scripts/gpt.sh
```

The results will be saved in the `outputs_full/likert` folder.

For the GUI agent evaluation, first build the environment by following the instructions in `envs/README.md`.
Then, run `utils/preprocess.py` to preprocess the data.
Ensure you have the correct API key for the GUI agent in `api_keys/ui_tars.json`.
Afterward, you can run the following command to evaluate the models:

```bash
bash run_serial.sh 99 1 10 0
```

The results will be saved in the folders created by the preprocessing script.