# Supplementary Material

### Introduction to GUI-World
Our dataset covers six GUI scenarios and eight types of GUI-orientated questions in three formats, with 12,379 videos and more than 100k QA pairs focusing on both static and dynamic GUI content.

**We present our dataset in four aspects:**

### Video-Text Pairs

In this section, we only showcase the text files from the benchmark due to the substantial size of the videos, which is approximately **92 GB** and includes **12,379** videos. The text files of the entire dataset are also quite large, nearly **50 MB**. We have divided the dataset into six JSONL files based on GUI scenarios, located in the `Benchmark/` directory. Each line in these files is a dictionary containing the following keys:

- `system`: The annotated system
- `multi`: Indicates whether the video involves multiple windows, i.e., operations across different websites or software
- `app`: The software being operated
- `region`: Specifies whether the GUI content is focused on a specific region or the entire screen/software/webpage
- `goal`: The overall purpose of the video
- `keyframes`: Key frames annotated by human annotators, with each item containing subgoal and mouse/keyboard operations
- `video_path`: Path to the corresponding video
- `Description1`: A global description of the video
- `Caption`: A brief summary
- `static QA`: Queries regarding the static GUI content in the video
- `MCQA`: Multiple-choice questions related to the GUI content in the video
- `Description2`: A more detailed description of the video
- `Sequential-QA`: Questions about the temporal content in the video
- `Prediction`: Questions predicting the next stage in the video
- `Conversation`: Multi-round conversations focusing on GUI content, mostly consisting of 2 rounds
- `Reasoning`: Reasoning questions about GUI content, both static and temporal, presented as multiple-choice questions

### Case Study

Here, we present one video and its extracted keyframes for each GUI scenario. The files are located in the `Case Study/` directory.

### Benchmarking Code

We demonstrate how to benchmark three commercial Image LLMs using the provided API:

```bash
python benchmark.py --input "<benchmark_file>" \
--output "auto" \
--model "<benchmark_model>" \
--setting "<GUI scenarios>" \
--keyframe "<keyframe selection method>"
```

This will generate a results file, which can be evaluated using the `code/llm-judge.py` script:

```bash
python llm-judge.py --input "<filename>" \
--output "auto" \
--model "gpt-4" \
[Other parameters] \
--directly  # Only for No CoT response
```
### Ethical Considerations
The content of the videos complies with double-blind requirements and does not disclose any personal privacy or involve any commercial infringement.