<h1 align="center">WebCanvas: Benchmarking Web Agents in Online Environments</h1>


![Main Figure](src/main_figure.png)


## 🔍 Evaluation on Existing WebCanvas Benchmarks

### Setting Up the Environment

First, ensure your environment is ready by installing the necessary dependencies:

```bash 
conda create -n webcanvas python=3.11
conda activate webcanvas
pip install -r requirements.txt
```

Before running the repos, you need to set up the required API keys as using features dependent on external APIs. Please refer to this [docs](agent/LLM/README.md).

#### Recommended Environment for Mind2Web-Live

From our experiments, the experimental environment plays a crucial role in agent performance. We recommend experimenting on a Windows server using Chrome or Firefox browser engines, preferably on servers located in the United States. 

### Run the Evaluation

You can run the repos with the following command:

```bash
python evaluate.py \
    --global_reward_mode dom_reward \
    --index -1 \
    --single_task_name "Find Dota 2 game and add all DLC to cart in steam." \
    --planning_text_model gpt-4o-mini \
    --global_reward_text_model gpt-4o-mini
```

This command runs the script with DOM-based self-reward, processing the default task "Find Dota 2 game and add all DLC to cart in steam" or using the default data index -1. It also uses the LLM for both observation and global reward processing. The evaluation mode is controlled by the `task_mode` parameter in `configs/setting.toml`, allowing you to choose between batch mode and single mode(without automatic evaluation). Remember to specify your path to the test file in `configs/setting.toml`.

If you want to batch test the Mind2Web-Live dataset, change `task_mode` to `batch_tasks` in `configs/settings.toml`. The default is `single_task` mode, which only allows testing of one sample at a time.


```toml
# configs/settings.toml:
[basic]
task_mode = "batch_tasks"    # single_task or batch_tasks
```



### Parameter Descriptions

This program supports several command-line arguments to customize its behavior:

- `--global_reward_mode`: Selects the method for getting global rewards.
  - Options: `dom_vision_reward`, `dom_reward`, `vision_reward`, `no_global_reward`
  - Default: `dom_reward`
  - Description: Define how rewards are got based on the interaction mode:
    - `dom_vision_reward`: Rewards are calculated using both DOM and vision data. Currently only support GPT4v as vision model.
    - `dom_reward`: Rewards are based solely on DOM interactions. You can specify the language model you want to use for reward reasoning by parameter *global_reward_text_model*.
    - `vision_reward`: Rewards are derived from vision-based interactions only. Currently only support GPT4v as vision model.
    - `no_global_reward`: No global rewards are calculated.

- `--index`: Decide which data index to start with.
  - Type: String
  - Default: `-1`
  - Description: Use this parameter to specify a range or specific index for data processing. For example, `0,5` will process data from index 0 to 5.

- `--single_task_name`: Defines the task name of the single task to execute.
  - Type: String
  - Default: `"Find Dota 2 game and add all DLC to cart in steam."`

- `--planning_text_model`: Specifies the model used for planning module.
  - Type: String
  - Default: `gpt-4o-mini`

- `--global_reward_text_model`: Specifies the model used for global reward reasoning.
  - Type: String
  - Default: `gpt-4o-mini`

#### Interaction Mode

Evaluating web agents in an online environment can sometimes be painful due to issues like network problems or bot tests on certain websites. Adopting an evaluation method that accommodates these issues allows for an accurate assessment of an agent's performance under specific current conditions. Additionally, we provide a more flexible interaction mode, enabling users to manually solve environmental issues and get the optimized performance of their web agents. You can simply set the `interaction_mode` parameter in `configs/setting.toml` to enable this feature. We will accumulate our implementation on error handling in online agent inference, and try to minimize human efforts by triggering only when exceptions occur in the following version. 