# Hi-Agent

## Key Features

- Hierarchical architecture with dual-model design:
  - High-level reasoning model for semantic subgoal generation
  - Low-level action model for precise execution
- Comprehensive action visualization and logging capabilities
- Support for various action types:
  - Click actions
  - Swipe gestures
  - Text input
  - System button interactions
  - Task completion status
- Discrete-time screen capture and action visualization
- Detailed logging of reasoning processes and actions

## Project Structure

```
Hi-Agent/
├── agent.py                    # Main agent implementation with dual-model architecture
├── env.py                      # Android environment interface and control
├── evaluate.py                 # Evaluation scripts for model performance
├── autoui_utils.py             # UI automation utilities
├── utils/                      # Utility functions
│   └── agent_function_call.py  # Function calling implementation for the agent
├── ENV_README.md              # Environment setup guide
├── requirements.txt           # Python package dependencies
└── README.md                  # Project documentation
```

The project is organized as follows:
- `agent.py`: Core implementation of HiAgent, including both the high-level reasoning model and low-level action model.
- `env.py`: Implementation of the Android environment interface, handling device control and state management.
- `evaluate.py`: Scripts for evaluating model performance.
- `autoui_utils.py`: Utilities for UI automation and interaction.
- `utils/`: Directory containing function calling and utilities.
  - `agent_function_call.py`: Implementation of function calling mechanism for the agent.
- `ENV_README.md`: Detailed guides for environment setup and configuration.
- `README.md`: Detailed guides for model setup and configuration.

## Quick Start
### Dependencies

First, create a [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) environment and install all pip package requirements.

```bash
conda create -n hiagent python==3.10
conda activate hiagent
pip install -r requirements.txt
```

### Environment Setup

To set up the Android environment for the HiAgent to interact with, refer to [the environment README](./ENV_README.md). 

### Model Checkpoints

1. The trained reasoning model has been uploaded to an anonymous Hugging Face repository (https://huggingface.co/HiAgent/Hi-Agent-Model). You can download the reasoning model locally and modify the `reason_model_path` in `agent.py` accordingly.

2. For the action model, you can directly use Qwen2.5-VL-3B-Instruct. In our tests, this action model has shown excellent performance. Download it locally and update the `action_model_path` in `agent.py`.


### Configuration Setup

1. First, you need to correctly set the following parameters in `agent.py`:

```python
# Android environment configuration
"android_avd_home": "/home/xxx/.android/avd", 
"emulator_path": "/home/xxx/.android/emulator/emulator",  
"adb_path": "/home/xxx/.android/platform-tools/adb",  

# Model and data paths
"assets_path": "/home/xx/assets/task_set",  # Path to task dataset
"reason_model_path": "/home/xx/Hi-Agent",  # Path to high-level reasoning model
"function_model_path": "/home/xx/Qwen2.5-VL-3B-Instruct",  # Path to low-level action model

# API configuration
"gemini_key": "xx", 

# Task configuration
"task_set": "general",  # Task set to use (general/webshopping)
"task_split": "train"   # Data split to use (train/test)
```

### Running Experiments

After modifying the configuration to your preferences, you can run experiments using the following command:

```bash
python agent.py -> log_dual_test.txt
```

The model's reasoning process and specific action results will be stored in `log_dual_test.txt`. Additionally, the code will automatically save screenshots after each action to the specified `save_path`.

## License

All content of this work is under Apache License v2.0, including codebase, data, and model checkpoints.

## Acknowledgments

We would like to express our sincere gratitude to the following open-source projects and communities:

- [Android-in-the-Wild (AitW)](https://github.com/google-research/android-in-the-wild) for providing the benchmark taskset
- [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) for the powerful vision-language model
- [DigiRL](https://github.com/DigiRL-agent/digirl) for their pioneering work in mobile device control

We also thank all contributors and the open-source community for their continuous support and inspiration.

