<h1 align="center">VeriWeb: Verifiable Long-Chain Web Benchmark for Agentic Information-Seeking</h1>


> [!NOTE]
> This work is still in progress and additional data will be included in a future version.



## 📖 Overview

Recent advances have showcased the extraordinary capabilities of Large Language Model (LLM) agents in tackling web-based information-seeking tasks. However, existing efforts mainly focus on **single-fact retrieval** and rely on **outcome-only verification**, thereby limiting their scalability in realistic knowledge-intensive scenarios that involve long-horizon web tasks requiring large-scale retrieval and synthesis of information from diverse sources. In this work, we introduce VeriWeb, a novel verifiable long-chain web benchmark designed to facilitate the evaluation and development of web agents within realistic web environments. Our benchmark emphasizes two critical dimensions: 

- (1) **🔗 Long-chain complexity**, encompassing both breadth- and depth-oriented search tasks to assess how effectively web agents ensure comprehensive information coverage and  consistent context tracking in multi-hop reasoning; 

- (2) **✅ subtask-level verifiability**, where tasks are decomposed into a sequence of interdependent verifiable subtasks. 
This structure enables diverse exploration strategies within each subtask, while ensuring that each subtask-level answer remains unchanged and verifiable.

The benchmark consists of **302** tasks across five real-world domains, each with a complete trajectory demonstration, **annotated by human experts**. Extensive experiments on VeriWeb using various agents powered by different foundation models reveal significant performance gaps in handling long-horizon web tasks, highlighting the need for more powerful agentic information-seeking capabilities


## 🚀 Installation

```bash
# Only for evaluating
pip install openai tqdm

# Run agents
pip install openai tqdm camel-ai[all] browser-use
```

## 🤖 Running Agents

We provide some examples of agents under the `agents` directory. You can run these agents by executing the following command:

```shell
python agents/some_agent.py
```

## 📊 Evaluation

The dataset of VeriWeb is located at [data.json](data/data.json). The format of the dataset is described in detail in the following sections.

```json
[
  {
    "id": "1",              // index id
    "name": "V1_3",         // name of the task
    "type": "global",       // type of the task, global or causal
    "instruction": "xxxxx", // instruction for the task
    "answer": "xxxxx",      // expected answer for the task, in JSON format
  },
  ......
]
```

The evaluation script `evaluate.py` can be used to evaluate the performance of agents using LLM-as-a-judge. The evaluation script expects a JSON format file with the following format:

```json
[
  {
    "id": "1",              // index id
    "name": "V1_3",         // name of the task
    "type": "global",       // type of the task, global or causal
    "instruction": "xxxxx", // instruction for the task
    "answer": "xxxxx",      // expected answer for the task, in JSON format
    "prediction": "xxxxx",  // agent's predicted result
    "nsteps": 10,           // number of steps taken by the agent
  },
  ......
]
```

With this file, you can run the evaluation script to get the performance of the agent:

```shell
python evaluate.py --input_file prediction.json --output_file output.json
```

Then, you can use `calc_avg.py` to calculate the average score of the evaluation results:

```shell
python calc_avg.py --input_file output.json
```

## 🗂️ Project Structure

The directory structure of the project is defined as follows:

```
agent-workflow-devkit/
├── agents/                 # Agent implementations
│   └── browseruse.py       # Browser-use agent example
│   └── owl.py              # Multi-agent system example
│   └── deepresearch.py     # Deep research agent example
├── data/                   # Dataset files
│   └── data.json           # Main dataset
├── evaluated/              # Evaluation results
├── predictions/            # Model predictions
├── evaluate.py             # Evaluation script
├── batch_evaluate.py       # Batch evaluation
├── calc_avg.py             # Calculate averages
└── utils.py                # Utility functions
```


## 📄 License

This project is licensed under the Apache 2.0 License.


