# SeekBench Toolkit

A comprehensive toolkit for evaluating search-based AI agents, including trace annotation, analysis for agent's epistemic competences.

## Overview

This toolkit provides a complete pipeline for:
1. **Evaluation**: Running search-based AI agents on various datasets
2. **Annotation**: Automatically annotating agent traces with reasoning quality metrics
3. **Analysis**: Deep analysis of agent groundness, recovery dynamics, and answering calibration

## Directory Structure

```
search_evaluation_toolkit/
├── evaluation/           # Core evaluation scripts
│   ├── simple_eval.py   # Main evaluation script
│   └── rerun_evaluation.py  # Re-run evaluations with different models
├── annotation/          # Trace annotation system
│   ├── main.py         # Main annotation script
│   ├── prompts.py      # LLM prompts for annotation
│   ├── ontology.json   # Annotation ontology definitions
│   └── ...            # Other annotation utilities
├── analysis/           # Analysis and visualization scripts
│   ├── O1_RL_training_impact_on_reasoning_search.py  # RQI analysis
│   ├── O2_Recovery_dynamics.py                       # Recovery dynamics analysis
│   └── O3_Evidence_strength_per_turn.py             # Calibration analysis
├── scripts/            # Shell scripts for automation
│   ├── eval_simple.sh  # Batch evaluation script
│   └── rerun_exp.sh    # Re-run experiments script
├── data/              # Data directory
│   ├── seekbench_data.jsonl  # Seekbench annotated dataset
│   └── cleaning_summary.md                  # Dataset cleaning documentation
├── run_analysis_example.py  # Example script for Seekbench analysis
└── requirements.txt   # Python dependencies
```

## Quick Start

### 1. Installation

```bash
# Clone or copy the toolkit
cd search_evaluation_toolkit

# Install dependencies
pip install -r requirements.txt

# Set up environment variables (create .env file)
echo "OPENAI_API_KEY=your_api_key_here" > .env
echo "OPENAI_BASE=https://your-openai-endpoint" >> .env  # Optional
```

### 2. Quick Analysis with Included Dataset

**Option A: Use the included Seekbench dataset (Recommended for quick start)**
```bash
# Run analysis directly on the included annotated dataset
python analysis/O1_RL_training_impact_on_reasoning_search.py \
    --input data/seekbench_data.jsonl \
    --outdir analysis_results
```

### 3. Running Evaluations (Optional)

**Option B: Run your own evaluations**
```bash
# Run evaluation on a single model and dataset
python evaluation/simple_eval.py \
    --model_path "Qwen/Qwen2.5-7B-Instruct" \
    --data_path "data/hotpotqa/process_test.parquet" \
    --output_path "eval_results/hotpotqa/qwen_results.jsonl" \
    --do_search \
    --val_batch_size 32
```

#### Batch Evaluation
```bash
# Make scripts executable
chmod +x scripts/*.sh

# Run batch evaluation (modify models and datasets in the script)
./scripts/eval_simple.sh
```

### 4. Annotating Traces

```bash
# Annotate evaluation results
python annotation/main.py \
    --input_file "eval_results/hotpotqa/qwen_results.jsonl" \
    --output_file "annotated_results/hotpotqa/qwen_annotated.jsonl" \
    --ontology_file "annotation/ontology.json" \
    --model "gpt-4o-mini" \
    --concurrency 10
```

### 5. Running Analysis

#### RQI Analysis (O1)
```bash
python analysis/O1_RL_training_impact_on_reasoning_search.py \
    --input final_annotated_traces.jsonl \
    --outdir analysis_results/figs_cohort \
    --require_clear_for_sufficient \
    --exclude_pre_sufficient_answers
```

#### Recovery Dynamics Analysis (O2)
```bash
python analysis/O2_Recovery_dynamics.py \
    final_annotated_traces.jsonl \
    analysis_results/recovery_figs
```

#### Evidence Strength Analysis (O3)
```bash
python analysis/O3_Evidence_strength_per_turn.py \
    --input final_annotated_traces.jsonl \
    --outdir analysis_results/evidence_figs
```

## Included Dataset

### Seekbench Annotated Dataset

The toolkit includes a pre-annotated dataset from Seekbench (`data/seekbench_data.jsonl`) containing 190 annotated traces with the following structure:

```json
{
  "question": "Which is the body of water by Leo Bennett's place of death?",
  "ground_truth": ["River Thames"],
  "reward": 1,
  "is_correct": true,
  "trace": [
    {
      "type": "reasoning",
      "content": "First, I need to find out where Leo Bennett died.",
      "step_index": 0,
      "trace_dependency": {"dependent_on": null},
      "annotation": {
        "type": "PlanFormation",
        "attributes": {
          "premise_grounding": "Not Grounded"
        }
      }
    },
    {
      "type": "search",
      "query": "Leo Bennett place of death",
      "step_index": 1,
      "trace_dependency": {"dependent_on": 0},
      "annotation": {
        "type": "InitialQuery",
        "attributes": {
          "premise_grounding": "Directly Grounded"
        }
      }
    }
  ],
  "final_answer": "River Thames",
  "queries": ["Leo Bennett place of death", "Leo Bennett place of death Thames Ditton"],
  "model": "fewshot-Qwen2.5-7B-Instruct",
  "dataset": "musique",
  "id": 1
}
```

**Key Annotation Fields:**
- `type`: Annotation type (StateAssessment, PlanFormation, InformationSynthesis, etc.)
- `premise_grounding`: Whether reasoning is grounded in evidence ("Directly Grounded" or "Not Grounded")
- `information_quality`: Search result quality ("Sufficient" or "Insufficient")
- `information_clarity`: Search result clarity ("Clear", "Unclear", or "Contradictory")

**Usage:**
```bash
# Option 1: Run all analyses with one command
python run_analysis_example.py

# Option 2: Run individual analyses
python analysis/O1_RL_training_impact_on_reasoning_search.py \
    --input data/seekbench_data.jsonl \
    --outdir analysis_results
```

## Key Features

### Evaluation System
- **Multi-model support**: Qwen, SearchR1, DeepResearcher, ReSearch, ASearcher
- **Multi-dataset support**: HotpotQA, Natural Questions, 2WikiMultiHopQA, MuSiQue, TriviaQA, PopQA, Bamboogle
- **Few-shot and zero-shot evaluation**
- **Batch processing capabilities**

### Annotation System
- **Automated trace annotation** using LLM-based analysis
- **Comprehensive ontology** covering reasoning types, search quality, and answer correctness
- **Premise grounding analysis** to evaluate evidence-based reasoning
- **Concurrent processing** for efficiency

### Analysis Capabilities

#### O1:RQI (Reasoning Quality Index)
- **Model level RQI** & **Type level RQI**

#### O2: Recovery Dynamics
- **Low-quality evidence episode detection**
- **Survival analysis** of recovery patterns
- **Action-based recovery analysis** (Refine, Follow-up, Repeat, Grounded reasoning)
- **Kaplan-Meier curves** for time-to-exit analysis

#### O3: Evidence Strength Analysis
- **Evidence level mapping** (E=0,1,2 based on sufficiency and clarity)
- **Correctness vs evidence strength** correlation
- **Calibration analysis** of agent confidence

## Configuration


### Annotation Configuration
- Modify `annotation/ontology.json` to adjust annotation categories
- Update `annotation/prompts.py` to customize LLM prompts
- Adjust concurrency settings based on API limits

## Output Formats

### Evaluation Output
```json
{
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "dataset": "hotpotqa",
  "question_id": "example_id",
  "question": "What is the question?",
  "ground_truth": "Expected answer",
  "trace": [
    {
      "type": "reasoning",
      "content": "I need to find information about...",
      "step_index": 0
    },
    {
      "type": "search",
      "query": "search query",
      "step_index": 1
    }
  ]
}
```

### Annotation Output
```json
{
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "dataset": "hotpotqa",
  "question_id": "example_id",
  "question": "What is the question?",
  "ground_truth": "Expected answer",
  "trace": [
    {
      "type": "reasoning",
      "content": "I need to find information about...",
      "step_index": 0,
      "annotation": {
        "type": "StateAssessment",
        "justification": "Agent is assessing knowledge gap",
        "attributes": {
          "premise_grounding": "Directly Grounded"
        }
      },
      "trace_dependency": {
        "dependent_on": null
      }
    }
  ]
}
```
