## Overview

This directory contains the complete training pipeline used in the paper, including:

- `preprocess/`: preprocessing raw data and constructing **general / trap** training data;
- `sft/`: supervised fine-tuning (SFT) on instruction-following data;
- `EasyR1/`: reinforcement fine-tuning (RFT) based on `faithful_grpo`;
- `evaluate/`: evaluation on the unified test set (TMR / AMR, episode success rate) after SFT or RFT;
- `preprocess/baseline/`: conversion and evaluation for comparison with OS-Atlas, GUI-OWL, UI-TARS, etc.

Below we briefly describe the logic and key hyper‑parameters of each stage.

---

## 1. Data preprocessing (`preprocess/`)

For detailed commands, see `preprocess/README.md`. Here we summarize the main logic:

- **Step 1: Process raw datasets**
  - Use `preprocess/data/process_ac.py` and `process_aitz.py` to convert the raw **Android Control** and **AITZ** datasets into a unified JSON format.

- **Step 2: Build general / trap splits**
  - `preprocess/create_general_data.py`:
    - Randomly samples from multiple JSON sources to build the **general** split (`data_type = 0`), with a target scale of roughly `1000 test + 4000 train` general samples.
  - `preprocess/sample_for_trap.py`:
    - Filters **click actions** (`gt_action.action == "click"`) from the original JSON files and samples about `700 test + 2000 train` click examples as **candidates** for trap construction.
  - `preprocess/create_trap_data.py`:
    - Based on the sampled click examples, constructs the **trap** split (`data_type = 2`), containing three types of perturbations:
      - **mask**: mask regions of the image;
      - **inpaint**: edit regions via OpenCV inpainting;
      - **instruction_modify**: modify the natural language instruction using a large vision-language model.

- **Step 3: Annotation and formatting**
  - `preprocess/annotate_general/annotate.py` and `preprocess/annotate_trap/annotate.py`:
    - Use vLLM + Qwen3-VL to batch‑annotate **Thought / Action / <tool_call>** for each sample.
  - `preprocess/format_for_training.py`:
    - Converts annotated results into the same JSON schema as the released `faithful_dataset` (including `prompt_str`, `answer_str`, `data_type`, etc.), and merges general + trap into the final train / test splits.

---

## 2. SFT stage (`sft/`)

### Config: `sft/sft.yaml`

Key (placeholder) settings:

- **model**
  - `model_name_or_path: /MODEL_PATH`: base multimodal model (e.g., Qwen3-VL).
  - `trust_remote_code: true`.

- **method**
  - `stage: sft`, `finetuning_type: full`: full‑parameter supervised fine‑tuning.
  - `deepspeed: /DEEPSPEED_CONFIG_PATH`: Deepspeed configuration path.

- **dataset**
  - `dataset: rft_train_llamafactory_dt0`: only **general** data (`data_type = 0`) is used for SFT.
  - `dataset_dir: /DATASET_DIR`: SFT data directory (produced by the preprocessing stage).
  - `template: qwen3_vl_nothink`: chat template.
  - `cutoff_len: 8192`, `preprocessing_num_workers: 256`.

- **output & train**
  - `output_dir: /OUTPUT_DIR`
  - `per_device_train_batch_size: 4`
  - `gradient_accumulation_steps: 8`
  - `learning_rate: 1e-5`
  - `num_train_epochs: 3.0`
  - `lr_scheduler_type: cosine`
  - `warmup_ratio: 0.1`
  - `bf16: true`

### Launcher: `sft/run_sft.sh`

Thin wrapper around LLaMA‑Factory CLI:

- Sets `CUDA_VISIBLE_DEVICES`, `CONFIG_PATH` (the `sft.yaml` above), `LLAMA_FACTORY_DIR`, etc.;
- Launches:

```bash
DISABLE_VERSION_CHECK=1 FORCE_TORCHRUN=1 \
CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES} \
llamafactory-cli train ${CONFIG_PATH}
```

---

## 3. RFT stage: `faithful_grpo` (in `EasyR1/`)

RFT is mainly configured by:

- `EasyR1/examples/faithful_grpo.sh`
- `EasyR1/examples/faithful_grpo_config.yaml`
- `EasyR1/examples/reward_function/faithful_grpo.py`

### 3.1 Launcher: `faithful_grpo.sh`

Main logic:

- Sets:
  - `MODEL_PATH`: initialization for RFT (usually the SFT checkpoint);
  - `CONFIG_PATH`: path to `faithful_grpo_config.yaml`;
  - `TRAIN_DATASET_PATH` / `VAL_DATASET_PATH`: mixed general+trap data produced by `preprocess + format_for_training`;
  - `CHECKPOINT_DIR`: where RFT checkpoints are saved;
  - `CUDA_VISIBLE_DEVICES`, `N_GPUS_PER_NODE`, etc.

- Calls:

```bash
python3 -m verl.trainer.main \
  config=${CONFIG_PATH} \
  worker.actor.model.model_path=${MODEL_PATH} \
  data.train_files=${TRAIN_DATASET_PATH} \
  data.val_files=${VAL_DATASET_PATH} \
  ...
```

### 3.2 Key config: `faithful_grpo_config.yaml`

Only the fields most relevant to **faithfulness / reward / adv_estimator** are highlighted here:

- **data**
  - `train_files / val_files`: training / validation JSON files;
  - `prompt_key: prompt_str`, `answer_key: answer_str`, `image_key: images`;
  - `rollout_batch_size: 384`, `val_batch_size: 1024`;
  - `min_pixels / max_pixels`: basic resolution filtering.

- **algorithm**
  - `adv_estimator: grpo_anchor_var_temp`:
    - GRPO with **anchor + variance tempering** for advantage estimation;
    - Typically more stable than vanilla PPO/GRPO under noisy rewards (e.g., GUI vision tasks).
  - `disable_kl: false`, `use_kl_loss: true`;
  - `kl_penalty: low_var_kl`, `kl_coef: 1e-2`:
    - KL penalty to keep the policy close to a reference model;
    - Larger `kl_coef` → more conservative; smaller → more exploratory.
  - `online_filtering: false` / `filter_*`:
    - Online sample filtering by reward range; disabled in this configuration.

- **worker.actor**
  - `global_batch_size: 192`
  - `micro_batch_size_per_device_for_update: 2`
  - `micro_batch_size_per_device_for_experience: 4`
  - `model.enable_gradient_checkpointing: true`
  - `optim.lr: 1e-6`, `weight_decay: 1e-2`

- **reward**
  - `reward_function: ./examples/reward_function/faithful_grpo.py:compute_score`
  - `reward_function_kwargs`:
    - `action_match_weight: 0.85`
    - `consistency_weight: 0.15`
    - `click_threshold: 140.0`
    - `use_consistency_reward: true`
    - `use_continuous_reward: true`
    - `click_tau: 60.0`

These map directly to the implementation in `faithful_grpo.py`.

### 3.3 Reward function `faithful_grpo.py` (core idea)

`compute_score` returns, for each sample:

- **Action‑match score** in \([0, 1]\)
  - First parses the model output into a Qwen3‑style action via `parse_model_output_to_qwen3`;
  - Then performs **parameter‑level** comparison depending on action type:
    - `click` / `long_press`: use click distance `distance` and map to reward via `exp(-distance / click_tau)`;
    - `type` / `answer`: compute normalized edit distance between predicted and ground‑truth text;
    - `swipe`: compare swipe direction, then magnitude of the displacement;
    - `system_button` / `terminate`: compare button / status exactly.
  - If parameters cannot be reliably parsed, it falls back to a type‑only reward (0 or 0.3).

- **Thought‑Action consistency score** in \([0, 1]\)
  - Parses `Thought:` and `Action:` from the output;
  - Uses rich keyword sets (e.g., “scroll”, “search”, “back”, “terminate”, etc.) to infer **intent** from the Thought, and checks whether the Action is semantically consistent:
    - If Thought says “scroll down to see more items” and Action is a downward swipe, give high reward;
    - If Thought says “go back to the previous screen” but Action clicks on content, give a negative reward.

- **Overall reward**
  - `overall = action_match_weight * action_match + consistency_weight * consistency`
  - In the config we use `0.85 / 0.15`; the default in code is `0.9 / 0.1`. The config values take precedence.

> **Tuning guidelines**
> - To emphasize **correct actions**, increase `action_match_weight` (e.g., use `0.9 / 0.1`);
> - To emphasize **reasoning consistency**, increase `consistency_weight`;
> - `click_threshold` and `click_tau` jointly control the tolerance to click errors:
>   - Smaller threshold / tau → stricter click precision;
>   - Larger threshold / tau → more tolerant to coordinate deviations.

---

## 4. Evaluation on test set (`evaluate/`)

After RFT (or SFT), we evaluate the model on the same test JSON produced by preprocessing (mixed general + trap, same schema as `val_files`). The evaluation pipeline under `evaluate/` runs **offline inference** and computes **TMR / AMR** and episode-level success rate, using the same action space and matching rules as the reward function.

### 4.1 Main script: `evaluate/evaluate_test.py`

- **Input**
  - `--test_json_path`: path to the test JSON (e.g. the same file as `VAL_DATASET_PATH` or a dedicated test split);
  - `--model_path`: checkpoint to evaluate (SFT or RFT);
  - `--base_model_path` (optional): if set, config is loaded from here and weights from `model_path` (for adapters / merged checkpoints).

- **Inference**
  - Loads the model with HuggingFace `AutoModelForImageTextToText` + `AutoProcessor` (Qwen3-VL style);
  - Builds `messages` from each sample (`messages`, `images`), runs batched `model.generate` with `do_sample=False`;
  - Supports **multi-GPU multi-process** evaluation: `--device_ids` and `--agent_count` split episodes across processes, each process loads one model replica.

- **Action parsing and matching**
  - Uses `evaluate/qwen3_action_mapper.py`:
    - `parse_model_output_to_qwen3(response)` parses the model output into a Qwen3-style action dict;
    - `is_qwen3_action_type_match(pred, gt)` checks action type;
    - `is_qwen3_action_match(pred, gt, click_threshold)` checks type + parameters (click distance ≤ `click_threshold`, text edit distance, swipe direction, etc.).
  - `click_threshold` defaults to **140.0**, consistent with `faithful_grpo.py` and the baseline evaluation.

- **Metrics**
  - **TMR (Type Match Rate)**: fraction of samples where the predicted action **type** matches the ground truth;
  - **AMR (Action Match Rate)**: fraction of samples where the full action (type + parameters) is correct (e.g. click within threshold, correct text, etc.);
  - Metrics are computed **overall** and **by `data_type`** (e.g. general vs trap), and written to:
    - `--result_save_path`: full results + metrics (default: `evaluation_results.json`);
    - `*_statistics_by_datatype.json`: summary statistics per data type.

### 4.2 Usage example

```bash
python evaluate/evaluate_test.py \
  --test_json_path /PATH_TO_TEST_JSON \
  --result_save_path /PATH_TO_OUTPUT/evaluation_results.json \
  --model_path /PATH_TO_SFT_OR_RFT_CHECKPOINT \
  --device_ids "[0,1,2,3]" \
  --agent_count 4 \
  --batch_size 8 \
  --click_threshold 140.0
```

This step is **independent of baseline conversion**: it directly evaluates our model on the unified test set. The baseline evaluation (Section 5) then takes care of converting results into formats required by OS-Atlas, GUI-OWL, UI-TARS, etc., for comparison with other methods.

---

## 5. Baseline evaluation logic (`preprocess/baseline`)

To compare with existing GUI baselines, we provide a set of **result conversion + evaluation scripts** under `preprocess/baseline/`:

- `run_preprocess.py`:
  - Takes unified JSON prediction files (with model actions, etc.) and converts them into the formats required by OS‑Atlas, GUI‑OWL, UI‑TARS, and other benchmarks.

- `utils/result_preprocess.py`:
  - Defines a family of `*_RES_PRE_PROCESS` classes to parse and analyze outputs from different models;
  - Converts them into a shared internal action space (`CLICK / TYPE / SCROLL / PRESS_* / COMPLETE / ...`), and computes metrics based on:
    - whether the **action type** matches;
    - whether the **click distance** is below 140;
    - whether **text** matches;
    - whether **scroll direction** matches.
  - Aggregates statistics such as TMR / AMR / TSR, etc.


