# DEER: Deep Evaluation & Enhancement for Reports

This repository contains the official implementation of **DEER**, a framework for automated quality evaluation of Deep Research reports. This code is submitted as supplementary material.

## 1. Project Overview

DEER is designed to evaluate long-form reports generated by Large Language Models (LLMs) across multiple dimensions:
- **Report Quality**: Evaluating structural coherence, analytical soundness, and adherence to formatting standards.
- **Information Integrity**: Verifying the factual accuracy of claims against external sources.
- **Information Sufficiency**: Assessing the depth and breadth of the information provided.

The framework utilizes a multi-stage pipeline involving web retrieval, claim verification, and LLM-based judging.

## 2. Directory Structure

```
DEER/
├── deepeval.yml                 # Conda environment configuration
├── run_information_verification.py # 1. Fact-checking & Sufficiency evaluation script
├── run_report_evaluation.py     # 2. Report Quality evaluation script (LLM-based)
├── run_score_integration.py     # 3. Score aggregation and final JSON generation
├── factchecker_v10/             # Core logic module
│   ├── eval_main.py             # Main entry point for evaluation
│   ├── claim_processor.py       # Claim extraction and processing
│   ├── source_judge.py          # Source credibility and relevance judgment
│   ├── web_downloader.py        # Web content retrieval
│   └── ...
├── script/                      # Shell scripts for batch processing (by domain)
├── data/                        # Input data directory (Reports to be evaluated)
└── output/                      # Evaluation results (JSON format)
```

## 3. Environment Setup

The project requires Python 3.10+. We recommend using Conda for environment management.

### Installation

1. Create the conda environment from the provided YAML file:
   ```bash
   conda env create -f deepeval.yml
   ```

2. Activate the environment:
   ```bash
   conda activate deepeval
   ```

3. Setup environment variables:
   - Create a `.env` file in the root directory.
   - Add your API keys (e.g., OpenAI API key for the judge models).
        ```bash
        OPENAI_API_KEY=
        JINA_API_KEY=
        ANTHROPIC_API_KEY=
        OPENROUTER_API_KEY=
        ```

## 4. Usage

The evaluation pipeline consists of three main steps. You can run them individually or use the provided shell scripts in the `script/` directory for automation.

### Step 1: Report Quality Evaluation
This step evaluates qualitative aspects such as **Analytical Soundness**, **Structural Coherence**, and **Request Fulfillment** using a judge LLM.

```bash
python run_report_evaluation.py \
  --root "data/math" \
  --prefix "gpt5_deep" \
  --eval_model "gpt-5.2" \
  --samples "1-10" \
  --output_dir "output/math"
```

### Step 2: Information Verification
This step evaluates the **Information Integrity** and **Information Sufficiency** of the reports. It extracts claims, verifies them against web search results, and checks citation validity.

```bash
python run_information_verification.py \
  --root "data/math" \
  --prefix "gpt5_deep" \
  --samples "1-10" \
  --eval_model "gpt-4.1-mini" \
  --output_root "output/math"
```
- `--root`: Path to the directory containing sample folders (e.g., `data/math/1/`).
- `--prefix`: File prefix for the report (e.g., verifies `gpt5_deep_1.md`).
- `--samples`: IDs of samples to evaluate (e.g., `1,2,5-10`).
- `--eval_model`: Model used for claim extraction and verification.

### Step 3: Score Integration
This step aggregates the scores from Step 1 (Report Quality) and Step 2 (Information Verification) into a final JSON report for each sample and a summary for the entire batch.

```bash
python run_score_integration.py \
  --prefix "gpt5_deep" \
  --samples "1-10" \
  --output_dir "output/math"
```

### Batch Execution (Example)
You can refer to `script/run_math.sh` for a complete workflow example that processes multiple models/prefixes sequentially.

```bash
# Example: Run full pipeline for Math domain
bash script/run_math.sh
```

## 5. Input Data Format

The `data/` directory should follow this structure:
```
data/<domain>/
├── <sample_id>/
│   ├── query.md              # The user query/topic
│   ├── core_criteria.md      # Specific criteria for this query
│   └── <prefix>_<sample_id>.md   # The generated report to evaluate
```

## 6. Output Format

Results are saved in the `output/` directory:
- `output/<domain>/<prefix>/fact/`: JSON results from Step 1.
- `output/<domain>/<prefix>/others/`: JSON results from Step 2.
- `output/<domain>/<prefix>/final/`: Integrated JSON results containing all metrics.
- `output/<domain>/<prefix>/final.json`: Summary of average scores for the batch.

Each final JSON contains:
- `score_avgs`: Aggregated scores for Integrity, Sufficiency, etc.
- `criteria_avgs`: Detailed breakdown of scores for specific criteria (e.g., claim accuracy, citation validity).
