# Bird-Interact Agentic Evaluation 

This part is for running and evaluating BIRD-Interact Agent.

## Agentic Evaluation Setting
Here we introduce the agentic evaluation mode of BIRD-Interact.

### Task Setting
A task consists of two phases: an **ambiguous user query** (phase-1) followed by a **follow-up query** (phase-2). All these queries require the understanding of external knowledge. To complete these queries, agents could interact with two environments: a **Database Environment** and a **User Simulator Environment**. Each interaction is constrained by a **budget** measured in virtual *bird-coin*s, with different actions incurring varying *bird-coin*s. Only after phase-1 is completed, user will deliver the follow-up query. The rewards for phase-1 and phase-2 are 0.7 and 0.3 respectively. 

> **Why set budget?**: Agent has the risk of unlimited exploration behaviors (e.g. ReAct Agent may fall into infinite loop). This setting also helps explore Interaction-time scaling (budget ↑) and stress testing (budget ↓) behavious of Agent. Additinoally, it could test the agent's planning and decision making ability.


### Environments and Action Space
Different from most studies only containing a working environment, our environment includes a **Database Env** and a **User Simulator Env**.
- [Database Env](src/envs/bird_interact_env/sql_env.py): A PostgreSQL database with a set of tables.
- [User Simulator Env](src/envs/user_simulator/us_env_bird_interact.py): A user simulator that can clarify ambiguities and test the submitted SQL from the Agent.


Heuristically, we set higher cost for those actions interacting with users, i.e. `ask` and `submit`.

### Budget Calculation
We set the budget case by case. 
- **Starting Budget** (adjustable parameter): First, we assign a "starting budget", (default **6** *bird-coin*s) for each task. 
- **Ambiguity Resolution Budget**: Then, for the task with many ambiguities, it may require more budget to finish the task. Thus, we assign **2** *bird-coin*s for each ambiguity of the user query. 
- **User Patience Budget** (adjustable parameter): Also, we introduce a adjustable *bird-coin*s to indicate the user patience, including four basic levels, **0**, **2*3**, **2*5**, **2*7** *bird-coin*s.

The total budget = starting budget + ambiguity resolution budget + user patience budget. You could find the budget calculation in batch-running implemented in [batch_run_bird_interact/main.py](batch_run_bird_interact/main.py#L110).

Current basic experiments are conducted with **Starting Budget** = **6** *bird-coin*s, **User Patience Budget** = **6** *bird-coin*s. Feel free to adjust these parameters to test different settings.

## Repository Structure

```
.
├── batch_run_bird_interact/    # Code for batch-running experiments (primary results)
├── src/                        # Core source code for single-run experiments
├── experiments/                # Main code for single-run experiments
├── data/                       # Dataset storage
├── outputs/                    # Experiment results: single-run and batch-run results
├── postgre_table_dumps/        # PostgreSQL database dumps
├── Dockerfile.postgresql       # PostgreSQL container configuration
├── Dockerfile.so_eval         # Evaluation environment configuration
├── docker-compose.yml         # Container orchestration
├── requirements.txt           # Python dependencies
├── run_batch_experiments.sh   # Batch experiment runner
└── run_experiment.sh          # Single experiment runner
```

## Quick Start

### 1. Data Preparation

```bash
mkdir data && cd data
# Download bird-interact-lite dataset
# Combine with GT fields (contact us for access) into bird_interact_data.jsonl
```

### 2. Environment Setup

1. Download the database dumps:
   - Get from available sources
   - Move to current working directory `bird_interact_agent` and rename to `postgre_table_dumps`

2. Build and run Docker containers:
   ```bash
   cd bird_interact_agent
   docker compose up --build
   ```
   This launches two containers:
   - PostgreSQL database
   - Evaluation environment (so_eval_env)

#### OpenAI/Third-party API Setup
Configure in `src/llm_utils/config.py`:
- Set `base_url`
- Set `api_key`

## Running Experiments

### Start the Docker containers

```bash
cd bird_interact_agent
docker compose exec so_eval_env bash
```

### Single Sample Mode
Single sample mode (`src/` and `experiments/`) is useful for debugging and workflow understanding
```bash
bash run_experiment.sh
```
Output directory: `outputs/single_runs/`

### Batch Mode
Batch mode (`batch_run_bird_interact/`) is recommended for production runs
```bash
bash run_batch_experiments.sh
```
Output directory: `outputs/batch_runs/`
Default user patience is set to 6.
