# WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks


| ![teaser.png](figs/overview.png) |
|:---|
| <p align="justify"><b>Figure 1. The WebChoreArena challenge. WebChoreArena introduces more complex and labor-intensive chore tasks in the web domains, pushing the boundaries of web agent capabilities. This enhanced benchmark allows for a clearer evaluation of progress in advanced models and reveals that even powerful models such as GPT-5 still have significant room for improvement.


## 📕 Abstract
As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: *Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves?* In this paper, we introduce **WebChoreArena**, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) **Massive Memory** tasks requiring accurate retrieval of large amounts of information in the observations, (ii) **Calculation** tasks demanding precise mathematical reasoning, and (iii) **Long-Term Memory** tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted the four WebArena simulation environments, 
WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress.



![task_type.png](figs/task_type.png)

## 📦 Requirements

### API KEY
We used Azure OpenAI API, Anthoropic Claude API, and Google Gemini API for the experiments. We will share the code for OpenAI API soon.

※ For GPT-5, please set AZURE_OPENAI_API_EAST_KEY and AZURE_OPENAI_EAST_ENDPOINT.

```bash
export AZURE_OPENAI_API_KEY='your-api-key-here'
export AZURE_OPENAI_ENDPOINT="your-azure-endpoint-here"
export ANTHROPIC_API_KEY='your-api-key-here'
export GEMINI_API_KEY='your-api-key-here'
export AZURE_OPENAI_API_EAST_KEY='your-api-key-here'
export AZURE_OPENAI_EAST_ENDPOINT="your-azure-endpoint-here"
```

### End-to-end Evaluation
1. Setup the standalone environment following the original [Webarena repository](https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md).
2. After evaluating each website, reset the environment to the initial state following the instructions [here](https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md#environment-reset). After the reset, run the inference for cross-site tasks.


## 📂 Code Structure
```bash
WebChoreArena/
│── figs/                # figures used across the project
│── AgentOcccam/                # Web browsing agent module
│── BrowserGym/                # Web browsing agent module
│── README.md              # Main documentation for the overall project
```
Please dive in to the project of [AgentOccam](./agentocccam/) and [BrowserGym](./browsergym/) for more details.

## 📊 Dataset
We provide dataset JSON files in either [AgentOccam/config_files](./AgentOccam/config_files) or [BrowserGym/config_files](./BrowserGym/config_files). The benchmark can be run directly without any additional downloads.


### Columns Info.

The columns in this JSON file are defined as follows:

| Column Name         | Description |
|---------------------|-------------|
| `task_id`           | Unique identifier for the task |
| `sites`             | Websites used in the task |
| `start_url`         | Initial URL where the agent begins |
| `start_url_lite`    | Simplified start URL for easier tasks |
| `strage_state`      | Path to login/session state |
| `affect_environment`| Whether the task affects the environment |
| `required_wait`     | Whether a wait is needed after task |
| `intent_template`   | Template defining task goal |
| `intent`            | Specific task goal or instruction |
| `required_obs`      | Required modalities (any/text/image) |
| `type_main`         | Main task category |
| `type_sub`          | Subcategory of task |
| `description`       | How the task should be performed |
| `instantiation_dict`| Dictionary with content for templates |
| `eval`              | Evaluation method used |

### Small Set
Running the full WebChoreArena benchmark can cost several hundred dollars in API usage. Therefore, we also provide a small subset of tasks. For each directory, the file `small_set_ids.txt` inside `config_files` specifies the task IDs used in the small subset. This subset corresponds to the one used in the subset experiments reported in Table 2 of the paper.


## 🤝 Acknowledgement
We adopt these codes to create this repository. We sincerely appreciate the great work/codebases.
* [WebArena](https://github.com/web-arena-x/webarena/tree/main)
* [VisualWebArena](https://github.com/web-arena-x/visualwebarena/tree/main)
* [AgentOccam](https://github.com/amazon-science/AgentOccam)
* [BrowserGym](https://github.com/ServiceNow/BrowserGym)
* [AWM](https://github.com/zorazrw/agent-workflow-memory)
