# WebArena Evaluation System

A comprehensive evaluation framework for validating agent performance on WebArena tasks across multiple web platforms including GitLab, Reddit, Shopping sites, and more.

## Overview

This evaluation system validates agent responses against expected outcomes for three types of tasks:
- **Retrieve**: Verify returned information matches expected data
- **Navigate**: Check browser state and URL navigation 
- **Mutate**: Validate system state modifications via API or direct inspection

## Entry Point

The main evaluation interface is the `evaluate_task` method in the `WebArenaEvaluator` class:

```python
from evaluation.evaluator import WebArenaEvaluator
from evaluation.types import WebArenaTask, WebArenaTaskResponse
from evaluation.models import AllocationResource

evaluator = WebArenaEvaluator()

results = await evaluator.evaluate_task(
    task=task,
    task_result=task_result, 
    resources=resources
)
```

## Parameters

### 1. `task: WebArenaTask`
Represents the task specification that defines what the agent was asked to accomplish. This object contains the task definition, evaluation criteria, and all necessary context for determining success. It serves as the ground truth for what should have been achieved and how to validate the results.

Task definitions are loaded from `assets/webarena_verified.json`.

### 2. `task_result: WebArenaTaskResponse`
Represents the agent's actual execution results and response data. This object captures what the agent actually did, including its structured response, final browser state, and any errors encountered. It provides the evaluation system with the agent's output to compare against expected outcomes.

### 3. `resources: list[AllocationResource]`
Represents the live test environment where the task was executed. These objects provide connection details and credentials needed for the evaluation system to inspect the running systems and validate that the agent's actions had the intended effects on the actual platforms.

## Task Types & Evaluation Methods

### Retrieve Tasks
Validate that returned information matches expected data:
- Compare agent's `results` field against `expected_retrieve_value`
- Support exact match, normalized text comparison, and structured data validation
- Example: "What is the top-selling product in 2022?"

### Navigate Tasks  
Verify browser reached the correct state:
- **URL Validation**: Check `last_urls` against `expected_ui_state.url`
- **DOM Inspection**: Use CDP to validate page elements and content
- Example: "Navigate to the user profile page"

### Mutate Tasks
Confirm system state was modified correctly:
- **API Validation**: Connect to system APIs to verify data changes
- **Database Inspection**: Direct database queries to validate state
- Example: "Create a new GitLab issue with title 'Bug Report'"

## Integration Example

```python
import asyncio
from evaluation.evaluator import WebArenaEvaluator
from evaluation.types import WebArenaTask, WebArenaTaskResponse, WebArenaVerifiedAgentResponse, ActionType, StatusType, WebArenaTaskStatus
from evaluation.models import AllocationResource

async def evaluate_agent_response():
    # Load task definition (or create programmatically)
    task = WebArenaTask.parse_file("assets/webarena_verified.json")[0]
    
    # Agent response
    agent_response = WebArenaVerifiedAgentResponse(
        action=ActionType.RETRIEVE,
        status=StatusType.SUCCESS,
        results=["Quest Lumaflex™ Band"]
    )
    
    task_result = WebArenaTaskResponse(
        response=agent_response,
        last_urls=["https://shop.example.com/admin/reports"],
        status=WebArenaTaskStatus.SUCCESS
    )
    
    # Test environment resources
    resources = [
        AllocationResource(
            allocation_id="test-alloc-123",
            site_id="test-site-456",
            container_name="shopping-admin-container",
            website_type="shopping_admin",
            base_url="https://shop.example.com",
            cdp_url="ws://localhost:9222",
            vnc_url="http://localhost:5900",
            readonly=False,
            username="admin",
            password="password123",
            role="admin"
        )
    ]
    
    # Evaluate
    evaluator = WebArenaEvaluator()
    results = await evaluator.evaluate_task(
        task=task,
        task_result=task_result,
        resources=resources
    )
    
    for result in results:
        print(f"Score: {result.score}")
        print(f"Success: {result.is_success}")
        print(f"Messages: {result.assertion_msg}")

# Run evaluation
asyncio.run(evaluate_agent_response())
```

## Evaluation Results

The system returns `WebarenaTaskEvalResult` objects containing:
- `score`: 1.0 for success, 0.0 for failure
- `assertion_msgs`: Detailed validation messages
- `validation_data`: Raw comparison data for debugging

## Supported Platforms

- **GitLab**: Issue management, project operations
- **Reddit**: Post creation, voting, commenting  
- **Shopping**: Product catalog, order management
- **Shopping Admin**: Administrative operations
- **Map**: Location-based tasks

## Requirements

- Python 3.8+
- Chrome/Chromium with DevTools Protocol support
- Network access to target systems
- Authentication credentials for protected platforms

## Error Handling

The evaluator provides detailed error reporting:
- Connection failures to target systems
- Authentication issues
- Malformed agent responses
- Evaluation function errors
- Resource allocation problems

All errors are captured in the evaluation results with descriptive messages for debugging.

## API Access Configuration

### GitLab Authentication
GitLab evaluations require a personal access token for API access. The default token is configured in `gitlab/apis.py` as `"my-super-secret-1234"`. If your running GitLab instance uses a different token, you need to either:

1. Update the `private_token` field in the `GitLabSettings` class in `gitlab/apis.py`
2. Set the environment variable `WEBARENA_VERIFIED_GITLAB_PRIVATE_TOKEN` with your token

The token must be a personal access token of the `byteblaze` user with appropriate permissions to read/write issues, projects, and other GitLab resources that the evaluation tasks require.
