# Reasoning-Executing-Gaps

Analysis and evaluation of the “reasoning–execution” gap for multimodal GUI agents. This repo provides inference scripts for multiple models (UI-TARS, GUI-Owl, AgentCPM-GUI), EM evaluation, CoT reasoning and GTA annotation/evaluation, plus quadrant analysis utilities.

- Model downloads: see [model/README.md](model/README.md)
- Evaluation data preparation: see [eval/eval_data/README.md](eval/eval_data/README.md)
- Model inference and EM evaluation: see [eval/README_TARS.md](eval/README_TARS.md), [eval/README_OWL.md](eval/README_OWL.md)
- CoT inference and GTA evaluation: see [eval/README_COT.md](eval/README_COT.md)
- Manual annotation workflow and automated summaries: see [cot_eval/rq1.md](cot_eval/rq1.md)

## Repository structure

- Data and scripts
  - Evaluation data prep: [eval/eval_data/README.md](eval/eval_data/README.md)
  - Inference/evaluation guides:
    - TARS series: [eval/README_TARS.md](eval/README_TARS.md)
    - GUI-Owl series: [eval/README_OWL.md](eval/README_OWL.md)
    - CoT reasoning + GTA evaluation: [eval/README_COT.md](eval/README_COT.md)
  - Key scripts and configs:
    - Inference (examples): [eval/run_predict_ui_tars.py](eval/run_predict_ui_tars.py), [eval/run_predict_ui_tars1_5.py](eval/run_predict_ui_tars1_5.py)
    - EM evaluation: [eval/run_eval_agent.py](eval/run_eval_agent.py)
    - Action schemas (execution vs extraction): [eval/utils/schema/schema.json](eval/utils/schema/schema.json), [eval/utils/schema/schema_for_extraction.json](eval/utils/schema/schema_for_extraction.json)
- CoT annotation and analysis
  - Lightweight GUI: [cot_eval/annot_ui_min.py](cot_eval/annot_ui_min.py)
  - Merging and experiment notes: [cot_eval/rq1.md](cot_eval/rq1.md)
- Dependencies
  - Top-level: [requirements.txt](requirements.txt)

## Environment

Linux + Python 3.10/3.11 + CUDA is recommended.

```bash
# Create and activate a conda env (Python 3.10 as example)
conda create -n reg python=3.10 -y
conda activate reg

# Install dependencies (includes PyTorch 2.5.1, etc.)
pip install -r requirements.txt
```

If you process raw evaluation data (optional; see “Evaluation data” below), use the environment suggested in its docs (some scripts assume Python 3.11).

## Models

Download required models into model/ as described in [model/README.md](model/README.md) (e.g., UI-TARS-1.5-7B, GUI-Owl-7B, GUI-Owl-32B). Ensure local paths match what the inference scripts expect.

## Evaluation data

Follow [eval/eval_data/README.md](eval/eval_data/README.md) to download and preprocess each dataset. Example:

```bash
# Enter the data prep directory
cd eval/eval_data

# Android Control
python process_ac.py
ln -s android_control_test android_control_high_test
ln -s android_control_test android_control_low_test

# CAGUI (chinese_app_test)
mkdir chinese_app_test && cd chinese_app_test
huggingface-cli download openbmb/CAGUI --repo-type dataset --include "CAGUI_agent/**" --local-dir ./ --local-dir-use-symlinks False --resume-download
mv CAGUI_agent test

# AITZ
cd ../
mv tmp/android_in_the_zoo ./aitz_test
python process_aitz.py
```

Return to the repo root for subsequent steps.

## Inference and EM evaluation

See per-model guides:
- UI-TARS: [eval/README_TARS.md](eval/README_TARS.md)
- GUI-Owl: [eval/README_OWL.md](eval/README_OWL.md)

Sample EM evaluation for UI-TARS-1.5-7B (from [eval/README_TARS.md](eval/README_TARS.md)):

```bash
# aitz_test
python eval/run_eval_agent.py --input_path ./eval_results/UI-TARS-1.5-7B/aitz_test/all.jsonl --output_dir ./eval_results/UI-TARS-1.5-7B/aitz_test/results --data_name aitz_test

# chinese_app_test
python eval/run_eval_agent.py --input_path ./eval_results/UI-TARS-1.5-7B/chinese_app_test/all.jsonl --output_dir ./eval_results/UI-TARS-1.5-7B/chinese_app_test/results --data_name chinese_app_test

# android_control_high_test
python eval/run_eval_agent.py --input_path ./eval_results/UI-TARS-1.5-7B/android_control_high_test/all.jsonl --output_dir ./eval_results/UI-TARS-1.5-7B/android_control_high_test/results --data_name android_control_high_test
```

Notes:
- Before evaluation, generate per-dataset inference outputs all.jsonl as instructed in the corresponding README.
- Quick toggles:
  - UI-TARS-1.5: USE_LOW_INSTRUCTION and LANGUAGE in [eval/run_predict_ui_tars1_5.py](eval/run_predict_ui_tars1_5.py)
  - History/action string construction is in the same script.
- GPU control: set CUDA_VISIBLE_DEVICES=... or edit device lists in the scripts (see [eval/run_predict_ui_tars.py](eval/run_predict_ui_tars.py)).

## CoT reasoning and GTA evaluation

- Follow [eval/README_COT.md](eval/README_COT.md) for CoT inference artifacts and GTA evaluation.
- JSON Schemas for execution/extraction:
  - Execution: [eval/utils/schema/schema.json](eval/utils/schema/schema.json)
  - Extraction: [eval/utils/schema/schema_for_extraction.json](eval/utils/schema/schema_for_extraction.json)

## Manual annotation and result merging (optional)

- Lightweight GUI (Gradio): [cot_eval/annot_ui_min.py](cot_eval/annot_ui_min.py)
- Multi-annotator merge, conflicts, and statistics: [cot_eval/rq1.md](cot_eval/rq1.md)
  - Examples include merging two independent reviews for GUI-Owl-7B + AITZ and exporting conflicts, etc.

## Troubleshooting

- Image/path issues
  - Ensure the preprocessed directory structure matches script expectations. Paths are resolved per the logic in [cot_eval/annot_ui_min.py](cot_eval/annot_ui_min.py).
- Missing/failed CoT parsing
  - The tool uses error codes (e.g., E_COT_MISSING, E_COT_PARSE_FAIL; see [cot_eval/annot_ui_min.py](cot_eval/annot_ui_min.py)). First check model outputs comply with the schema.
- Memory/performance
  - Reduce batch size, disable unnecessary long reasoning/history concatenation, or use a smaller model. If using DeepSpeed/Accelerate, tune for your hardware.

## Repro roadmap (quick)

1) Download models in model/ per [model/README.md](model/README.md)  
2) Prepare evaluation data in eval/eval_data/ per [eval/eval_data/README.md](eval/eval_data/README.md)  
3) Run inference and EM evaluation per [eval/README_TARS.md](eval/README_TARS.md) and/or [eval/README_OWL.md](eval/README_OWL.md)  
4) Run CoT inference and GTA evaluation per [eval/README_COT.md](eval/README_COT.md)  
5) (Optional) Manual annotation + automated evaluation per [cot_eval/rq1.md](cot_eval/rq1.md)  
6) Quadrant analysis: `python cot_eval/cot_analysis.py`  
7) Figures for paper: see `cot_eval/figures/`

## Acknowledgements

This repo aligns with and uses multiple open-source models and datasets. See subdirectory READMEs and upstream licenses for details.