# Data Preprocessing Pipeline

## Overview

This directory contains scripts for preprocessing datasets and generating annotated data.

## Workflow

### 1. Process Raw Data

First, process the raw datasets in the `data/` subdirectory:

```bash
# Process Android Control dataset
cd data
python process_ac.py \
    --input_dir /INPUT_DIR \
    --output_dir /OUTPUT_DIR \
    --splits_file /SPLITS_FILE

# Process AITZ dataset
python process_aitz.py \
    --input_dir /INPUT_DIR \
    --output_dir /OUTPUT_DIR \
    --image_key image_path
```

Save the processed data files in appropriate locations.

### 2. Create General and Trap Splits

Generate general and trap data splits:

```bash
# Create general data split (data_type=0)
# Note: This only generates general data, not the total dataset
# Expected: ~1000 test + ~4000 train = ~5000 general samples
python create_general_data.py \
    --input_files /INPUT_FILE_1 /INPUT_FILE_2 /INPUT_FILE_3 /INPUT_FILE_4 \
    --output_dir /OUTPUT_DIR \
    --test_size 1000 \
    --train_size 4000 \
    --random_seed 42

# Sample click actions for trap data generation
# Expected: ~700 test + ~2000 train = ~2700 click samples
python sample_for_trap.py \
    --input_files /INPUT_FILE_1 /INPUT_FILE_2 \
    --output_dir /OUTPUT_DIR \
    --test_size 700 \
    --train_size 2000 \
    --random_seed 42

# Create trap data split from sampled click actions
python create_trap_data.py \
    --input /SAMPLED_CLICK_INPUT_FILE \
    --output /OUTPUT_FILE \
    --output_image_dir /OUTPUT_IMAGE_DIR \
    --mask_ratio 0.35 \
    --inpaint_ratio 0.35 \
    --instruction_ratio 0.3 \
    --model_path /MODEL_PATH \
    --batch_size 512 \
    --device_ids "[0,1,2,3,4,5,6,7]" \
    --tensor_parallel_size 8 \
    --random_seed 42
```

### 3. Annotate Data

Annotate the general and trap data splits:

```bash
# Annotate general data
cd annotate_general
bash annotate_all.sh

# Annotate trap data
cd ../annotate_trap
bash annotate_all.sh
```

### 4. Format for Training

Convert annotated data to training format. Note that the final dataset should contain both general (data_type=0) and trap (data_type=2) data:

```bash
# Format general annotated data (data_type=0)
python format_for_training.py \
    --input_file /GENERAL_ANNOTATED_INPUT_FILE \
    --output_file /GENERAL_TRAINING_FORMAT_OUTPUT_FILE \
    --data_type 0

# Format trap annotated data (data_type=2)
python format_for_training.py \
    --input_file /TRAP_ANNOTATED_INPUT_FILE \
    --output_file /TRAP_TRAINING_FORMAT_OUTPUT_FILE \
    --data_type 2

# Merge general and trap data to create final training/test sets
# Expected distribution:
# - Train: ~4000 general (data_type=0) + ~2000 trap (data_type=2) = ~6000 total
# - Test: ~1000 general (data_type=0) + ~700 trap (data_type=2) = ~1700 total
# You can use a simple script to combine the two JSON files
```

## Directory Structure

- `data/`: Scripts for processing raw datasets (AC and AITZ)
- `annotate_general/`: Scripts for annotating general data splits
- `annotate_trap/`: Scripts for annotating trap data splits
- `utils/`: Utility modules (e.g., `qwen3_mobile_use.py`)
- `sample_for_trap.py`: Script to sample click actions for trap data generation
- `format_for_training.py`: Script to convert annotated data to training format
